pmid,title,year,journal,doi,mesh,keywords,abstract,authors
38534904,ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions.,2024,"European journal of investigation in health, psychology and education",,,,"(1) Background: As the field of artificial intelligence (AI) evolves, tools like ChatGPT are increasingly integrated into various domains of medicine, including medical education and research. Given the critical nature of medicine, it is of paramount importance that AI tools offer a high degree of reliability in the information they provide. (2) Methods: A total of n = 450 medical examination questions were manually entered into ChatGPT thrice, each for ChatGPT 3.5 and ChatGPT 4. The responses were collected, and their accuracy and consistency were statistically analyzed throughout the series of entries. (3) Results: ChatGPT 4 displayed a statistically significantly improved accuracy with 85.7% compared to that of 57.7% of ChatGPT 3.5 (p < 0.001). Furthermore, ChatGPT 4 was more consistent, correctly answering 77.8% across all rounds, a significant increase from the 44.9% observed from ChatGPT 3.5 (p < 0.001). (4) Conclusions: The findings underscore the increased accuracy and dependability of ChatGPT 4 in the context of medical education and potential clinical decision making. Nonetheless, the research emphasizes the indispensable nature of human-delivered healthcare and the vital role of continuous assessment in leveraging AI in medicine.",Funk PF; Hoch CC; Knoedler S; Knoedler L; Cotofana S; Sofo G; Bashiri Dezfouli A; Wollenberg B; Guntinas-Lichius O; Alfertshofer M
39310381,Preliminary discrimination and evaluation of clinical application value of ChatGPT4o in bone tumors.,2024,Journal of bone oncology,,,,"*Evaluation of making up ChatGPT4o in the preliminary pathological diagnosis of bone tumors.*ChatGPT-4o's proficiency in analyzing pathological images and providing initial diagnoses of bone tumor characteristics is comparable to that of senior pathologists in the Tertiary hospital doctors group, with both surpassing the Remote grassroots doctors group.*AI, like ChatGPT-4o, has the potential to enhance diagnostic capabilities for remote grassroots doctors and improve sensitivity to reduce missed diagnosis rates among tertiary hospital doctors in identifying bone tumors.",Huang L; Hu J; Cai Q; Ye A; Chen Y; Yang Xiao-Zhi Z; Liu Y; Zheng J; Meng Z
39465720,"Enhancing English abstract quality for non-English speaking authors using ChatGPT: A comparative study of Taiwan, Japan, China, and South Korea with slope graphs.",2024,Medicine,,,,"A clear and proficient English abstract is crucial for disseminating research findings to a global audience, significantly impacting the accessibility and visibility of research from non-English speaking countries. Despite the adoption of ChatGPT since November 30, 2022, a comprehensive analysis of improvements in English abstracts in scholarly journals has not been conducted. This study aims to identify which authors from Taiwan, Japan, China, and South Korea (TJCS) have shown the most improvement in English abstracts. Article abstracts published in Medicine (Baltimore) sourced from the Web of Science Core Collection from 2020 to 2023 were downloaded. A mixed-methods approach was employed, combining quantitative analysis of linguistic quality indicators and qualitative assessments of coherence and engagement using the Rasch model. Ten quality indicators were determined by prompting ChatGPT. Two scenarios were analyzed: (1) generative pretrained transformer (GPT) versus non-GPT (each with 30 abstracts from 2021) and (2) TJCS in comparison (each with 100 abstracts from 2021 and 2023, respectively). Standardized mean differences were compared using paired samples t test. Visuals including forest plots, Rasch Wright Map, the slope graph, and scatter plot with 95% control lines were used to examine the 2 scenarios. (1) No significant difference was found between GPT and non-GPT abstracts with Rasch logit scores of 3.31 and 3.17, respectively (P = .42), likely due to small sample size (n = 30); (2) significant difference exists between 2020 and 2023 in each country, and between South Korea and Taiwan in 2020. Among TJCS, Taiwan showed the greatest improvement in English abstract quality post-ChatGPT implementation, followed by Japan, China, and South Korea. The English abstracts in Medicine (Baltimore) have improved, reflecting the tool's positive impact on enhancing technical language. This study demonstrates that ChatGPT can enhance the quality of English abstracts for authors from non-English speaking regions, although the assumption that all authors use ChatGPT is invalid and impractical. The findings underscore the value of artificial intelligence tools in academic writing and recommend further investigation into the long-term implications of artificial intelligence integration in scholarly communication.",Chou W; Chow JC
38383555,Large language models streamline automated machine learning for clinical studies.,2024,Nature communications,,,,"A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from large trials across various medical specialties were presented to ChatGPT ADA without specific guidance. ChatGPT ADA autonomously developed state-of-the-art ML models based on the original study's training data to predict clinical outcomes such as cancer development, cancer progression, disease complications, or biomarkers such as pathogenic gene sequences. Following the re-implementation and optimization of the published models, the head-to-head comparison of the ChatGPT ADA-crafted ML models and their respective manually crafted counterparts revealed no significant differences in traditional performance metrics (p >/= 0.072). Strikingly, the ChatGPT ADA-crafted ML models often outperformed their counterparts. In conclusion, ChatGPT ADA offers a promising avenue to democratize ML in medicine by simplifying complex data analyses, yet should enhance, not replace, specialized training and resources, to promote broader applications in medical research and practice.",Tayebi Arasteh S; Han T; Lotfinia M; Kuhl C; Kather JN; Truhn D; Nebelung S
39978844,Clinically Relevant Family Medicine Research: Board Certification Updates.,2024,Journal of the American Board of Family Medicine : JABFM,,,,A new Patient Psychological Safety Scale (PPSS) has potential to address an often-unrecognized problem. Should HbA1c be used to follow diabetes in patients with concurrent sickle cell disease? Are there significant differences resulting from HbA1c point-of-care versus send-off testing? Which treatment for which type of incontinence? Which factors are more predictive of emotional exhaustion for clinicians versus nonclinician staff? Does your office apply fluoride to young children's teeth? Is testosterone deficiency associated with death in older men? How does ChatGPT impact board certification exams? What is the most effective treatment for vasomotor symptoms associated with menopause?,Bowman MA; Seehusen DA; Britz J; Ledford CJW
39569401,Can ChatGPT help patients understand radiopharmaceutical extravasations?,2024,Frontiers in nuclear medicine,,,,"A previously published paper in the official journal of the Society of Nuclear Medicine and Molecular Imaging (SNMMI) concluded that the artificial intelligence chatbot ChatGPT may offer an adequate substitute for nuclear medicine staff informational counseling to patients in an investigated setting of (18)F-FDG PET/CT. To ensure consistency with the previous paper, the author and a team of experts followed a similar methodology and evaluated whether ChatGPT could adequately offer a substitute for nuclear medicine staff informational counseling to patients regarding radiopharmaceutical extravasations. We asked ChatGPT fifteen questions regarding radiopharmaceutical extravasations. Each question or prompt was queried three times. Using the same evaluation criteria as the previously published paper, the ChatGPT responses were evaluated by two nuclear medicine trained physicians and one nuclear medicine physicist for appropriateness and helpfulness. These evaluators found ChatGPT responses to be either highly appropriate or quite appropriate in 100% of questions and very helpful or quite helpful in 93% of questions. The interobserver agreement among the evaluators, assessed using the Intraclass Correlation Coefficient (ICC), was found to be 0.72, indicating good overall agreement. The evaluators also rated the inconsistency across the three ChatGPT responses for each question and found irrelevant or minor inconsistencies in 87% of questions and some differences relevant to main content in the other 13% of the questions. One physician evaluated the quality of the references listed by ChatGPT as the source material it used in generating its responses. The reference check revealed no AI hallucinations. The evaluator concluded that ChatGPT used fully validated references (appropriate, identifiable, and accessible) to generate responses for eleven of the fifteen questions and used generally available medical and ethical guidelines to generate responses for four questions. Based on these results we concluded that ChatGPT may be a reliable resource for patients interested in radiopharmaceutical extravasations. However, these validated and verified ChatGPT responses differed significantly from official positions and public comments regarding radiopharmaceutical extravasations made by the SNMMI and nuclear medicine staff. Since patients are increasingly relying on the internet for information about their medical procedures, the differences need to be addressed.",Alvarez M
38352437,Evaluating ChatGPT's Accuracy in Providing Screening Mammography Recommendations among Older Women: Artificial Intelligence and Cancer Communication.,2024,Research square,,,,"Abstract Objective: The U.S. Preventive Services Task Force (USPSTF) recommends biennial screening mammography through age 74. Guidelines vary as to whether or not they recommended mammography screening to women aged 75 and older. This study aims to determine the ability of ChatGPT to provide appropriate recommendations for breast cancer screening in patients aged 75 years and older. Methods: 12 questions and 4 clinical vignettes addressing fundamental concepts about breast cancer screening and prevention in patients aged 75 years and older were created and asked to ChatGPT three consecutive times to generate 3 sets of responses. The responses were graded by a multi-disciplinary panel of experts in the intersection of breast cancer screening and aging . The responses were graded as 'appropriate', 'inappropriate', or 'unreliable' based on the reviewer's clinical judgment, content of the response, and whether the content was consistent across the three responses . Appropriateness was determined through a majority consensus. Results: The responses generated by ChatGPT were appropriate for 11/17 questions (64%). Three questions were graded as inappropriate (18%) and 2 questions were graded as unreliable (12%). A consensus was not reached on one question (6%) and was graded as no consensus. Conclusions: While recognizing the limitations of ChatGPT, it has potential to provide accurate health care information and could be utilized by healthcare professionals to assist in providing recommendations for breast cancer screening in patients age 75 years and older. Physician oversight will be necessary, due to the possibility of ChatGPT to provide inappropriate and unreliable responses, and the importance of accuracy in medicine.",Braithwaite D; Karanth SD; Divaker J; Schoenborn N; Lin K; Richman I; Hochhegger B; O'Neill S; Schonberg M
37433676,ChatGPT in Nuclear Medicine Education.,2023,Journal of nuclear medicine technology,,,,"Academic integrity has been challenged by artificial intelligence algorithms in teaching institutions, including those providing nuclear medicine training. The GPT 3.5-powered ChatGPT chatbot released in late November 2022 has emerged as an immediate threat to academic and scientific writing. Methods: Both examinations and written assignments for nuclear medicine courses were tested using ChatGPT. Included was a mix of core theory subjects offered in the second and third years of the nuclear medicine science course. Long-answer-style questions (8 subjects) and calculation-style questions (2 subjects) were included for examinations. ChatGPT was also used to produce responses to authentic writing tasks (6 subjects). ChatGPT responses were evaluated by Turnitin plagiarism-detection software for similarity and artificial intelligence scores, scored against standardized rubrics, and compared with the mean performance of student cohorts. Results: ChatGPT powered by GPT 3.5 performed poorly in the 2 calculation examinations (overall, 31.7% compared with 67.3% for students), with particularly poor performance in complex-style questions. ChatGPT failed each of 6 written tasks (overall, 38.9% compared with 67.2% for students), with worsening performance corresponding to increasing writing and research expectations in the third year. In the 8 examinations, ChatGPT performed better than students for general or early subjects but poorly for advanced and specific subjects (overall, 51% compared with 57.4% for students). Conclusion: Although ChatGPT poses a risk to academic integrity, its usefulness as a cheating tool can be constrained by higher-order taxonomies. Unfortunately, the constraints to higher-order learning and skill development also undermine potential applications of ChatGPT for enhancing learning. There are several potential applications of ChatGPT for teaching nuclear medicine students.",Currie G; Barry K
37225599,"Academic integrity and artificial intelligence: is ChatGPT hype, hero or heresy?",2023,Seminars in nuclear medicine,,,,"Academic integrity in both higher education and scientific writing has been challenged by developments in artificial intelligence. The limitations associated with algorithms have been largely overcome by the recently released ChatGPT; a chatbot powered by GPT-3.5 capable of producing accurate and human-like responses to questions in real-time. Despite the potential benefits, ChatGPT confronts significant limitations to its usefulness in nuclear medicine and radiology. Most notably, ChatGPT is prone to errors and fabrication of information which poses a risk to professionalism, ethics and integrity. These limitations simultaneously undermine the value of ChatGPT to the user by not producing outcomes at the expected standard. Nonetheless, there are a number of exciting applications of ChatGPT in nuclear medicine across education, clinical and research sectors. Assimilation of ChatGPT into practice requires redefining of norms, and re-engineering of information expectations.",Currie GM
38248809,Personalized Medicine in Urolithiasis: AI Chatbot-Assisted Dietary Management of Oxalate for Kidney Stone Prevention.,2024,Journal of personalized medicine,,,,"Accurate information regarding oxalate levels in foods is essential for managing patients with hyperoxaluria, oxalate nephropathy, or those susceptible to calcium oxalate stones. This study aimed to assess the reliability of chatbots in categorizing foods based on their oxalate content. We assessed the accuracy of ChatGPT-3.5, ChatGPT-4, Bard AI, and Bing Chat to classify dietary oxalate content per serving into low (<5 mg), moderate (5-8 mg), and high (>8 mg) oxalate content categories. A total of 539 food items were processed through each chatbot. The accuracy was compared between chatbots and stratified by dietary oxalate content categories. Bard AI had the highest accuracy of 84%, followed by Bing (60%), GPT-4 (52%), and GPT-3.5 (49%) (p < 0.001). There was a significant pairwise difference between chatbots, except between GPT-4 and GPT-3.5 (p = 0.30). The accuracy of all the chatbots decreased with a higher degree of dietary oxalate content categories but Bard remained having the highest accuracy, regardless of dietary oxalate content categories. There was considerable variation in the accuracy of AI chatbots for classifying dietary oxalate content. Bard AI consistently showed the highest accuracy, followed by Bing Chat, GPT-4, and GPT-3.5. These results underline the potential of AI in dietary management for at-risk patient groups and the need for enhancements in chatbot algorithms for clinical accuracy.",Aiumtrakul N; Thongprayoon C; Arayangkool C; Vo KB; Wannaphut C; Suppadungsuk S; Krisanapan P; Garcia Valencia OA; Qureshi F; Miao J; Cheungpasitporn W
39300965,Developing large language models to detect adverse drug events in posts on x.,2024,Journal of biopharmaceutical statistics,,,,"Adverse drug events (ADEs) are one of the major causes of hospital admissions and are associated with increased morbidity and mortality. Post-marketing ADE identification is one of the most important phases of drug safety surveillance. Traditionally, data sources for post-marketing surveillance mainly come from spontaneous reporting system such as the Food and Drug Administration Adverse Event Reporting System (FAERS). Social media data such as posts on X (formerly Twitter) contain rich patient and medication information and could potentially accelerate drug surveillance research. However, ADE information in social media data is usually locked in the text, making it difficult to be employed by traditional statistical approaches. In recent years, large language models (LLMs) have shown promise in many natural language processing tasks. In this study, we developed several LLMs to perform ADE classification on X data. We fine-tuned various LLMs including BERT-base, Bio_ClinicalBERT, RoBERTa, and RoBERTa-large. We also experimented ChatGPT few-shot prompting and ChatGPT fine-tuned on the whole training data. We then evaluated the model performance based on sensitivity, specificity, negative predictive value, positive predictive value, accuracy, F1-measure, and area under the ROC curve. Our results showed that RoBERTa-large achieved the best F1-measure (0.8) among all models followed by ChatGPT fine-tuned model with F1-measure of 0.75. Our feature importance analysis based on 1200 random samples and RoBERTa-Large showed the most important features are as follows: ""withdrawals""/""withdrawal"", ""dry"", ""dealing"", ""mouth"", and ""paralysis"". The good model performance and clinically relevant features show the potential of LLMs in augmenting ADE detection for post-marketing drug safety surveillance.",Deng Y; Xing Y; Quach J; Chen X; Wu X; Zhang Y; Moureaud C; Yu M; Zhao Y; Wang L; Zhong S
37256297,The Artificial Intelligence application in Aesthetic Medicine: How ChatGPT can Revolutionize the Aesthetic World.,2023,Aesthetic plastic surgery,,,,"Aesthetic medicine is witnessing a growing importance of ChatGPT and artificial intelligence (AI) technologies, as highlighted by the pioneering work of Xie et al. in their article, ""Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT."" These advancements promise to revolutionize patient consultations, treatment planning, and follow-up care. AI-driven chatbots, such as ChatGPT, can enhance patient consultations by providing accurate and reliable information on aesthetic procedures, their risks, benefits, and potential outcomes, enabling well-informed decisions and improved treatment outcomes. Furthermore, AI can personalize treatment plans by analyzing patient data, leading to increased precision and satisfaction. AI-powered platforms can also streamline patient follow-up and monitoring, improving patient outcomes and resource utilization, while serving as a valuable educational tool for clinicians. Despite these benefits, AI integration in aesthetic medicine raises concerns about data privacy, security, and potential biases in AI algorithms. To address these challenges, the aesthetic medicine community must establish ethical guidelines, adopt stringent security protocols, and ensure diverse and representative datasets for AI training. Additionally, maintaining the personal connection between patients and providers is crucial for preserving the human touch in patient care.Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors https://www.springer.com/00266 .",Buzzaccarini G; Degliuomini RS; Borin M
39377065,Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine.,2024,Molecular therapy. Nucleic acids,,,,"After ChatGPT was released, large language models (LLMs) became more popular. Academicians use ChatGPT or LLM models for different purposes, and the use of ChatGPT or LLM is increasing from medical science to diversified areas. Recently, the multimodal LLM (MLLM) has also become popular. Therefore, we comprehensively illustrate the LLM and MLLM models for a complete understanding. We also aim for simple and extended reviews of LLMs and MLLMs for a broad category of readers, such as researchers, students in diversified fields, and other academicians. The review article illustrates the LLM and MLLM models, their working principles, and their applications in diversified fields. First, we demonstrate the technical concept of LLMs, working principle, Black Box, and the evolution of LLMs. To explain the working principle, we discuss the tokenization process, token representation, and token relationships. We also extensively demonstrate the application of LLMs in biological macromolecules, medical science, biological science, and other areas. We illustrate the multimodal applications of LLMs or MLLMs. Finally, we illustrate the limitations, challenges, and future prospects of LLMs. The review acts as a booster dose for clinicians, a primer for molecular biologists, and a catalyst for scientists, and also benefits diversified academicians.",Bhattacharya M; Pal S; Chatterjee S; Lee SS; Chakraborty C
39663218,"Text-to-Video Models and Sora in Plastic Surgery: Pearls, Pitfalls, and Prospectives.",2024,Aesthetic plastic surgery,,,,"After the groundbreaking release of the highly acclaimed chatbot ChatGPT, which revolutionized the field of artificial intelligence (AI) last year, OpenAI has once again astounded the world with the unveiling of their latest generative AI model, Sora, on February 16, 2024. This cutting-edge model has the remarkable ability to generate videos up to a duration of 60 seconds solely through text instructions. With a series of AI-generated contents, such as AI chat, AI drawing, and AI music, emerging one after another, the era of ""AI revolution"" that had a disruptive impact on modern life has arrived. Meanwhile, AI has made significant achievements in the medical field, especially in diagnosing based on medical imaging. This article briefly describes the development history of text-to-video models and provides a detailed introduction to the Sora model, including its portrayal of the human face and contour, inspiring its potential applications in plastic surgery. It also provides a prospect for other AI-generated content technologies, such as text-to-holography and text-to-material objects.Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Kang Y; Wang S; Zhu L
39583462,Assessing the Performance of ChatGPT in Answering Patients' Questions Regarding Congenital Bicuspid Aortic Valve.,2024,Cureus,,,,"AIM: Artificial intelligence (AI) models, such as ChatGPT, are widely being used in academia as well as by the common public. In the field of medicine, the information obtained by the professionals as well as by the patients from the AI tools has significant advantages while at the same time posing valid concerns regarding the validity and adequacy of information regarding healthcare delivery and utilization. Therefore, it is important to vet these AI tools through the prism of practicing physicians. METHODS: To demonstrate the immense utility as well as potential concerns of using ChatGPT to gather medical information, a set of questions were posed to the chatbot regarding a hypothetical patient with a congenital bicuspid aortic valve (BAV), and the answers were recorded and reviewed based on three criteria: (i) readability/technicality; (ii) adequacy/completeness; and (iii) accuracy/authenticity. RESULTS: While the ChatGPT provided detailed information about clinical pictures, treatment, and outcomes regarding BAV, the information was generic and brief, and the utility was limited due to a lack of specific information based on an individual patient's clinical status. The authenticity of the information could not be verified due to a lack of citations. Further, human aspects that would normally emerge in nuanced doctor-patient communication were missing in the ChatGPT output. CONCLUSION: Although the performance of AI in medical care is expected to grow, imperfections and ethical concerns may remain a huge challenge in utilizing information from the chatbots alone without adequate communications with health providers, despite having numerous advantages of this technology to society in many walks of human life.",Barua M
39239689,Humans-written versus ChatGPT-generated case reports.,2024,The journal of obstetrics and gynaecology research,,,,"AIM: Artificial intelligence, especially ChatGPT, has been used in various aspects of medicine; however, whether ChatGPT can be used in case report writing is unknown. This study aimed to provoke discussion and provide a platform for it. METHODS: I wrote a theoretical case report where cyst aspiration cured a twisted ovarian cyst (Manuscript 4). I tasked ChatGPT with generating case reports by inputting information at three different levels: (1) key message and case profile, (2) addition of key introduction information (including known facts and problems to be solved), and (3) further addition of main discussion points. These inputs resulted in the creation of Manuscripts 1-3, which were subjected to analysis. Manuscript 3, generated by ChatGPT with the deepest information input, was compared with Manuscript 4, the human-authored counterpart. RESULTS: With the least information, Manuscript 1 can stand on its own, but its content is superficial. The more detailed data input, the more readable and reasonable the manuscripts become. A human-written manuscript involves personal experience and viewpoints other than obstetrics-gynecology. CONCLUSIONS: Better input produced more reasonable and readable case reports. Human-written paper, compared with ChatGPT-generated one, can involve ""human touch."" Whether such human touch enriches the case report awaits further discussion. Whether ChatGPT can be used in case report writing, and if it can, to what extent, should be worthy of further study. I encourage every doctor to form their own stance towards ChatGPT use in medical writing.",Matsubara S
37967485,Performance of Google bard and ChatGPT in mass casualty incidents triage.,2024,The American journal of emergency medicine,,,,"AIM: The objective of our research is to evaluate and compare the performance of ChatGPT, Google Bard, and medical students in performing START triage during mass casualty situations. METHOD: We conducted a cross-sectional analysis to compare ChatGPT, Google Bard, and medical students in mass casualty incident (MCI) triage using the Simple Triage And Rapid Treatment (START) method. A validated questionnaire with 15 diverse MCI scenarios was used to assess triage accuracy and content analysis in four categories: ""Walking wounded,"" ""Respiration,"" ""Perfusion,"" and ""Mental Status."" Statistical analysis compared the results. RESULT: Google Bard demonstrated a notably higher accuracy of 60%, while ChatGPT achieved an accuracy of 26.67% (p = 0.002). Comparatively, medical students performed at an accuracy rate of 64.3% in a previous study. However, there was no significant difference observed between Google Bard and medical students (p = 0.211). Qualitative content analysis of 'walking-wounded', 'respiration', 'perfusion', and 'mental status' indicated that Google Bard outperformed ChatGPT. CONCLUSION: Google Bard was found to be superior to ChatGPT in correctly performing mass casualty incident triage. Google Bard achieved an accuracy of 60%, while chatGPT only achieved an accuracy of 26.67%. This difference was statistically significant (p = 0.002).",Gan RK; Ogbodo JC; Wee YZ; Gan AZ; Gonzalez PA
38478902,ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment.,2025,Medical teacher,,,,"AIM: This study aimed to evaluate the real-life performance of clinical vignettes and multiple-choice questions generated by using ChatGPT. METHODS: This was a randomized controlled study in an evidence-based medicine training program. We randomly assigned seventy-four medical students to two groups. The ChatGPT group received ill-defined cases generated by ChatGPT, while the control group received human-written cases. At the end of the training, they evaluated the cases by rating 10 statements using a Likert scale. They also answered 15 multiple-choice questions (MCQs) generated by ChatGPT. The case evaluations of the two groups were compared. Some psychometric characteristics (item difficulty and point-biserial correlations) of the test were also reported. RESULTS: None of the scores in 10 statements regarding the cases showed a significant difference between the ChatGPT group and the control group (p > .05). In the test, only six MCQs had acceptable levels (higher than 0.30) of point-biserial correlation, and five items could be considered acceptable in classroom settings. CONCLUSIONS: The results showed that the quality of the vignettes are comparable to those created by human authors, and some multiple-questions have acceptable psychometric characteristics. ChatGPT has potential in generating clinical vignettes for teaching and MCQs for assessment in medical education.",Coskun O; Kiyak YS; Budakoglu II
38420596,Using generative artificial intelligence in bibliometric analysis: 10 years of research trends from the European Resuscitation Congresses.,2024,Resuscitation plus,,,,"AIMS: The aim of this study is to use generative artificial intelligence to perform bibliometric analysis on abstracts published at European Resuscitation Council (ERC) annual scientific congress and define trends in ERC guidelines topics over the last decade. METHODS: In this bibliometric analysis, the WebHarvy software (SysNucleus, India) was used to download data from the Resuscitation journal's website through the technique of web scraping. Next, the Chat Generative Pre-trained Transformer 4 (ChatGPT-4) application programming interface (Open AI, USA) was used to implement the multinomial classification of abstract titles following the ERC 2021 guidelines topics. RESULTS: From 2012 to 2022 a total of 2491 abstracts have been published at ERC congresses. Published abstracts ranged from 88 (in 2020) to 368 (in 2015). On average, the most common ERC guidelines topics were Adult basic life support (50.1%), followed by Adult advanced life support (41.5%), while Newborn resuscitation and support of transition of infants at birth (2.1%) was the least common topic. The findings also highlight that the Basic Life Support and Adult Advanced Life Support ERC guidelines topics have the strongest co-occurrence to all ERC guidelines topics, where the Newborn resuscitation and support of transition of infants at birth (2.1%; 52/2491) ERC guidelines topic has the weakest co-occurrence. CONCLUSION: This study demonstrates the capabilities of generative artificial intelligence in the bibliometric analysis of abstract titles using the example of resuscitation medicine research over the last decade at ERC conferences using large language models.",Fijacko N; Creber RM; Abella BS; Kocbek P; Metlicar S; Greif R; Stiglic G
39191671,Performance of the ChatGPT large language model for decision support in community pharmacy.,2024,British journal of clinical pharmacology,,,,"AIMS: The aim of this study was to assess the ChatGPT-4 (ChatGPT) large language model (LLM) on tasks relevant to community pharmacy. METHODS: ChatGPT was assessed with community pharmacy-relevant test cases involving drug information retrieval, identifying labelling errors, prescription interpretation, decision-making under uncertainty and multidisciplinary consults. Drug information on rituximab, warfarin, and St. John's wort was queried. The decision-support scenarios consisted of a subject with swollen eyelids and a maculopapular rash in a subject on lisinopril and ferrous sulfate. The multidisciplinary scenarios required the integration of medication management with recommendations for healthy eating and physical activity/exercise. RESULTS: The responses from ChatGPT for rituximab, warfarin, and St. John's wort were satisfactory and cited drug databases and drug-specific monographs. ChatGPT identified labeling errors related to incorrect medication strength, form, route of administration, unit conversion, and directions. For the patient with inflamed eyelids, the course of action developed by ChatGPT was comparable to the pharmacist's approach. For the patient with the maculopapular rash, both the pharmacist and ChatGPT placed a drug reaction to either lisinopril or ferrous sulfate at the top of the differential. ChatGPT provided customized vaccination requirements for travel to Brazil, guidance on management of drug allergies and recovery from a knee injury. ChatGPT provided satisfactory medication management and wellness information for a diabetic on metformin and semaglutide. CONCLUSIONS: LLMs have the potential to become a powerful tool in community pharmacy. However, rigorous validation studies across diverse pharmacist queries, drug classes and populations, and engineering to secure patient privacy will be needed to enhance LLM utility.",Shin E; Hartman M; Ramanathan M
40066678,Assessing GPT-4's accuracy in answering clinical pharmacological questions on pain therapy.,2025,British journal of clinical pharmacology,,,,"AIMS: This study aimed to evaluate the accuracy and completeness of GPT-4, a large language model, in answering clinical pharmacological questions related to pain therapy, with a focus on its potential as a tool for delivering patient-facing medical information. The objective was to assess its reliability in delivering medical information in the context of pain management. METHODS: A cross-sectional survey-based study was conducted with healthcare professionals, including physicians and pharmacists. Participants submitted up to 8 clinical pharmacology questions on pain management, focusing on drug interactions, dosages and contraindications. GPT-4's responses were evaluated based on comprehensibility, detail, satisfaction, medical-pharmacological accuracy and completeness. Additionally, responses were compared to the German Drug Directory to assess their accuracy. RESULTS: The majority of participants (99%) found GPT-4's responses comprehensible, while 84% considered the information detailed enough. Overall satisfaction was high, with 93% expressing satisfaction, and 96% deemed the responses medically accurate. However, only 63% rated the information as complete, with some identifying gaps in pharmacokinetics and drug interaction data. Usability was evaluated as good to excellent, with a System Usability Scale score of 83.38 (+/- 10.26). CONCLUSION: GPT-4 demonstrates potential as a tool for delivering medical information, particularly in pain management. However, limitations such as incomplete pharmacological data and the potential for contextual carryover in follow-up questions suggest that further refinement is necessary. Developing specialized artificial intelligence tools that integrate real-time pharmacological databases could improve accuracy and reliability for clinical decision-making.",Stroop A; Stroop T; Zawy Alsofy S; Wegner M; Nakamura M; Stroop R
38953544,Evaluation of artificial intelligence-generated drug therapy communication skill competencies in medical education.,2024,British journal of clinical pharmacology,,,,"AIMS: This study compared three artificial intelligence (AI) platforms' potential to identify drug therapy communication competencies expected of a graduating medical doctor. METHODS: We presented three AI platforms, namely, Poe Assistant(c), ChatGPT(c) and Google Bard(c), with structured queries to generate communication skill competencies and case scenarios appropriate for graduating medical doctors. These case scenarios comprised 15 prototypical medical conditions that required drug prescriptions. Two authors independently evaluated the AI-enhanced clinical encounters, which integrated a diverse range of information to create patient-centred care plans. Through a consensus-based approach using a checklist, the communication components generated for each scenario were assessed. The instructions and warnings provided for each case scenario were evaluated by referencing the British National Formulary. RESULTS: AI platforms demonstrated overlap in competency domains generated, albeit with variations in wording. The domains of knowledge (basic and clinical pharmacology, prescribing, communication and drug safety) were unanimously recognized by all platforms. A broad consensus among Poe Assistant(c) and ChatGPT(c) on drug therapy-related communication issues specific to each case scenario was evident. The consensus primarily encompassed salutation, generic drug prescribed, treatment goals and follow-up schedules. Differences were observed in patient instruction clarity, listed side effects, warnings and patient empowerment. Google Bard did not provide guidance on patient communication issues. CONCLUSIONS: AI platforms recognized competencies with variations in how these were stated. Poe Assistant(c) and ChatGPT(c) exhibited alignment of communication issues. However, significant discrepancies were observed in specific skill components, indicating the necessity of human intervention to critically evaluate AI-generated outputs.",Sridharan K; Sequeira RP
37626010,Evaluating the performance of ChatGPT in clinical pharmacy: A comparative study of ChatGPT and clinical pharmacists.,2024,British journal of clinical pharmacology,,,,"AIMS: To evaluate the performance of chat generative pretrained transformer (ChatGPT) in key domains of clinical pharmacy practice, including prescription review, patient medication education, adverse drug reaction (ADR) recognition, ADR causality assessment and drug counselling. METHODS: Questions and clinical pharmacist's answers were collected from real clinical cases and clinical pharmacist competency assessment. ChatGPT's responses were generated by inputting the same question into the 'New Chat' box of ChatGPT Mar 23 Version. Five licensed clinical pharmacists independently rated these answers on a scale of 0 (Completely incorrect) to 10 (Completely correct). The mean scores of ChatGPT and clinical pharmacists were compared using a paired 2-tailed Student's t-test. The text content of the answers was also descriptively summarized together. RESULTS: The quantitative results indicated that ChatGPT was excellent in drug counselling (ChatGPT: 8.77 vs. clinical pharmacist: 9.50, P = .0791) and weak in prescription review (5.23 vs. 9.90, P = .0089), patient medication education (6.20 vs. 9.07, P = .0032), ADR recognition (5.07 vs. 9.70, P = .0483) and ADR causality assessment (4.03 vs. 9.73, P = .023). The capabilities and limitations of ChatGPT in clinical pharmacy practice were summarized based on the completeness and accuracy of the answers. ChatGPT revealed robust retrieval, information integration and dialogue capabilities. It lacked medicine-specific datasets as well as the ability for handling advanced reasoning and complex instructions. CONCLUSIONS: While ChatGPT holds promise in clinical pharmacy practice as a supplementary tool, the ability of ChatGPT to handle complex problems needs further improvement and refinement.",Huang X; Estau D; Liu X; Yu Y; Qin J; Li Z
37123797,A Case Report on Ground-Level Alternobaric Vertigo Due to Eustachian Tube Dysfunction With the Assistance of Conversational Generative Pre-trained Transformer (ChatGPT).,2023,Cureus,,,,"Alternatenobaric vertigo (ABV) develops when the middle ear pressure (MEP) is not equal at the same height in the sea or the air. This is possible when the altitude changes. Eustachian tube dysfunction (ETD) is a common cause of ABV. In this case report, we discuss a patient who experienced repeated bouts of ground-level alternobaric vertigo (GLABV) due to ETD. We also discuss how Conversational Generative Pre-trained Transformer (ChatGPT) might be used in the creation of this case report. A 41-year-old male patient complained of vertigo at ground level on several occasions. His medical history included chronic sinusitis, nasal congestion, and laryngopharyngeal reflux (LPR). During the physical exam, his tympanic membranes were dull and moved less. Tympanometry showed that he had an asymmetric type A and that both of his middle ears had negative pressure. The results of the audiometry test were normal, and the laryngoscopy revealed LPR. The patient was found to have GLABV because of ETD, and different treatment options, such as Eustachian tube catheterization (ETC), were thought about. This case study demonstrates how ChatGPT can be used to assist with medical documentation and the treatment of GLABV caused by ETD. Even though ChatGPT did not provide specific diagnostic or treatment recommendations for the patient's condition, it did assist the doctor in determining what was wrong and how to treat it while writing the case report. It also aided the doctor in writing the case report by allowing them to discuss it. The use of artificial intelligence (AI) tools such as ChatGPT has the potential to improve the accuracy and speed of medical documentation, thereby streamlining clinical workflows and improving patient care. Nonetheless, it is critical to consider the ethical implications of using AI in clinical practice This case study emphasizes the importance of understanding that ETD is a common cause of GLABV and how ChatGPT can aid in the diagnosis and treatment of this condition. More research is needed to fully understand how long-term AI interventions in medicine work and how reliable they are.",Kim HY
37179277,Artificial Intelligence in Intensive Care Medicine: Toward a ChatGPT/GPT-4 Way?,2023,Annals of biomedical engineering,,,,"Although intensive care medicine (ICM) is a relatively young discipline, it has rapidly developed into a full-fledged and highly specialized specialty covering several fields of medicine. The COVID-19 pandemic led to a surge in intensive care unit demand and also bring unprecedented development opportunities for this area. Multiple new technologies such as artificial intelligence (AI) and machine learning (ML) were gradually being applied in this field. In this study, through an online survey, we have summarized the potential uses of ChatGPT/GPT-4 in ICM range from knowledge augmentation, device management, clinical decision-making support, early warning systems, and establishment of intensive care unit (ICU) database.",Lu Y; Wu H; Qi S; Cheng K
38056130,Assessing the potential of ChatGPT for psychodynamic formulations in psychiatry: An exploratory study.,2024,Psychiatry research,,,,"Although there were several attempts to apply ChatGPT (Generative Pre-Trained Transformer) to medicine, little is known about therapeutic applications in psychiatry. In this exploratory study, we aimed to evaluate the characteristics and appropriateness of the psychodynamic formulations created by ChatGPT. Along with a case selected from the psychoanalytic literature, input prompts were designed to include different levels of background knowledge. These included naive prompts, keywords created by ChatGPT, keywords created by psychiatrists, and psychodynamic concepts from the literature. The psychodynamic formulations generated from the different prompts were evaluated by five psychiatrists from different institutions. We next conducted further tests in which instructions on the use of different psychodynamic models were added to the input prompts. The models used were ego psychology, self-psychology, and object relations. The results from naive prompts and psychodynamic concepts were rated as appropriate by most raters. The psychodynamic concept prompt output was rated the highest. Interrater agreement was statistically significant. The results from the tests using instructions in different psychoanalytic theories were also rated as appropriate by most raters. They included key elements of the psychodynamic formulation and suggested interpretations similar to the literature. These findings suggest potential of ChatGPT for use in psychiatry.",Hwang G; Lee DY; Seol S; Jung J; Choi Y; Her ES; An MH; Park RW
39011990,Large Language Model-Based Natural Language Encoding Could Be All You Need for Drug Biomedical Association Prediction.,2024,Analytical chemistry,,,,"Analyzing drug-related interactions in the field of biomedicine has been a critical aspect of drug discovery and development. While various artificial intelligence (AI)-based tools have been proposed to analyze drug biomedical associations (DBAs), their feature encoding did not adequately account for crucial biomedical functions and semantic concepts, thereby still hindering their progress. Since the advent of ChatGPT by OpenAI in 2022, large language models (LLMs) have demonstrated rapid growth and significant success across various applications. Herein, LEDAP was introduced, which uniquely leveraged LLM-based biotext feature encoding for predicting drug-disease associations, drug-drug interactions, and drug-side effect associations. Benefiting from the large-scale knowledgebase pre-training, LLMs had great potential in drug development analysis owing to their holistic understanding of natural language and human topics. LEDAP illustrated its notable competitiveness in comparison with other popular DBA analysis tools. Specifically, even in simple conjunction with classical machine learning methods, LLM-based feature representations consistently enabled satisfactory performance across diverse DBA tasks like binary classification, multiclass classification, and regression. Our findings underpinned the considerable potential of LLMs in drug development research, indicating a catalyst for further progress in related fields.",Zhang H; Zhou Y; Zhang Z; Sun H; Pan Z; Mou M; Zhang W; Ye Q; Hou T; Li H; Hsieh CY; Zhu F
37568980,The Utility of Language Models in Cardiology: A Narrative Review of the Benefits and Concerns of ChatGPT-4.,2023,International journal of environmental research and public health,,,,"Artificial intelligence (AI) and language models such as ChatGPT-4 (Generative Pretrained Transformer) have made tremendous advances recently and are rapidly transforming the landscape of medicine. Cardiology is among many of the specialties that utilize AI with the intention of improving patient care. Generative AI, with the use of its advanced machine learning algorithms, has the potential to diagnose heart disease and recommend management options suitable for the patient. This may lead to improved patient outcomes not only by recommending the best treatment plan but also by increasing physician efficiency. Language models could assist physicians with administrative tasks, allowing them to spend more time on patient care. However, there are several concerns with the use of AI and language models in the field of medicine. These technologies may not be the most up-to-date with the latest research and could provide outdated information, which may lead to an adverse event. Secondly, AI tools can be expensive, leading to increased healthcare costs and reduced accessibility to the general population. There is also concern about the loss of the human touch and empathy as AI becomes more mainstream. Healthcare professionals would need to be adequately trained to utilize these tools. While AI and language models have many beneficial traits, all healthcare providers need to be involved and aware of generative AI so as to assure its optimal use and mitigate any potential risks and challenges associated with its implementation. In this review, we discuss the various uses of language models in the field of cardiology.",Gala D; Makaryus AN
37601525,"Artificial intelligence in medicine and research - the good, the bad, and the ugly.",2023,Saudi journal of anaesthesia,,,,"Artificial intelligence (AI) broadly refers to machines that simulate intelligent human behavior, and research into this field is exponential and worldwide, with global players such as Microsoft battling with Google for supremacy and market share. This paper reviews the ""good"" aspects of AI in medicine for individuals who embrace the 4P model of medicine (Predictive, Preventive, Personalized, and Participatory) to medical assistants in diagnostics, surgery, and research. The ""bad"" aspects relate to the potential for errors, culpability, ethics, data loss and data breaches, and so on. The ""ugly"" aspects are deliberate personal malfeasances and outright scientific misconduct including the ease of plagiarism and fabrication, with particular reference to the novel ChatGPT as well as AI software that can also fabricate graphs and images. The issues pertaining to the potential dangers of creating rogue, super-intelligent AI systems that lead to a technological singularity and the ensuing perceived existential threat to mankind by leading AI researchers are also briefly discussed.",Grech V; Cuschieri S; Eldawlatly AA
38477743,The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.,2024,Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research,,,,"Artificial intelligence (AI) chatbots utilizing large language models (LLMs) have recently garnered significant interest due to their ability to generate humanlike responses to user inquiries in an interactive dialog format. While these models are being increasingly utilized to obtain medical information by patients, scientific and medical providers, and trainees to address biomedical questions, their performance may vary from field to field. The opportunities and risks these chatbots pose to the widespread understanding of skeletal health and science are unknown. Here we assess the performance of 3 high-profile LLM chatbots, Chat Generative Pre-Trained Transformer (ChatGPT) 4.0, BingAI, and Bard, to address 30 questions in 3 categories: basic and translational skeletal biology, clinical practitioner management of skeletal disorders, and patient queries to assess the accuracy and quality of the responses. Thirty questions in each of these categories were posed, and responses were independently graded for their degree of accuracy by four reviewers. While each of the chatbots was often able to provide relevant information about skeletal disorders, the quality and relevance of these responses varied widely, and ChatGPT 4.0 had the highest overall median score in each of the categories. Each of these chatbots displayed distinct limitations that included inconsistent, incomplete, or irrelevant responses, inappropriate utilization of lay sources in a professional context, a failure to take patient demographics or clinical context into account when providing recommendations, and an inability to consistently identify areas of uncertainty in the relevant literature. Careful consideration of both the opportunities and risks of current AI chatbots is needed to formulate guidelines for best practices for their use as source of information about skeletal health and biology.",Cung M; Sosa B; Yang HS; McDonald MM; Matthews BG; Vlug AG; Imel EA; Wein MN; Stein EM; Greenblatt MB
39690824,The use of ChatGPT in the dermatological field: a narrative review.,2025,Clinical and experimental dermatology,,,,"Artificial intelligence (AI) encompasses the development of computer systems capable of tasks typically requiring human intelligence, such as visual perception, speech recognition, decision-making and language translation. Over time, numerous applications have emerged, with the integration of AI into medicine marking a significant leap forward in healthcare delivery, diagnosis and treatment. Among medical specialties, dermatology stands at the forefront of AI advancements, leveraging machine learning and deep learning to enhance dermatologists' abilities and improve patient care. ChatGPT is an advanced language model by OpenAI, originally designed for conversations, which has expanded its utility into diverse fields, including healthcare and dermatology. In this context, the aim of this review article was to explore the synergistic relationship between ChatGPT and dermatology, examining how this innovative AI model is reshaping skin health management, its potential applications, preliminary data on its efficiency and accuracy, as well as ethical and legal concerns related to the use of its tool.",Potestio L; Feo F; Martora F; Megna M; Napolitano M; D'Agostino M
38435218,Ocular Pathology and Genetics: Transformative Role of Artificial Intelligence (AI) in Anterior Segment Diseases.,2024,Cureus,,,,"Artificial intelligence (AI) has become a revolutionary influence in the field of ophthalmology, providing unparalleled capabilities in data analysis and pattern recognition. This narrative review delves into the crucial role that AI plays, particularly in the context of anterior segment diseases with a genetic basis. Corneal dystrophies (CDs) exhibit significant genetic diversity, manifested by irregular substance deposition in the cornea. AI-driven diagnostic tools exhibit promising accuracy in the identification and classification of corneal diseases. Importantly, chat generative pre-trained transformer (ChatGPT)-4.0 shows significant advancement over its predecessor, ChatGPT-3.5. In the realm of glaucoma, AI significantly contributes to precise diagnostics through inventive algorithms and machine learning models, surpassing conventional methods. The incorporation of AI in predicting glaucoma progression and its role in augmenting diagnostic efficiency is readily apparent. Additionally, AI-powered models prove beneficial for early identification and risk assessment in cases of congenital cataracts, characterized by diverse inheritance patterns. Machine learning models achieving exceptional discrimination in identifying congenital cataracts underscore AI's remarkable potential. The review concludes by emphasizing the promising implications of AI in managing anterior segment diseases, spanning from early detection to the tailoring of personalized treatment strategies. These advancements signal a paradigm shift in ophthalmic care, offering optimism for enhanced patient outcomes and more streamlined healthcare delivery.",Venkatapathappa P; Sultana A; K S V; Mansour R; Chikkanarayanappa V; Rangareddy H
37500980,AI-ChatGPT/GPT-4: An Booster for the Development of Physical Medicine and Rehabilitation in the New Era!,2024,Annals of biomedical engineering,,,,"Artificial intelligence (AI) has been driving the continuous development of the Physical Medicine and Rehabilitation (PM&R) fields. The latest release of ChatGPT/GPT-4 has shown us that AI can potentially transform the healthcare industry. In this study, we propose various ways in which ChatGPT/GPT-4 can display its talents in the field of PM&R in future. ChatGPT/GPT-4 is an essential tool for Physiatrists in the new era.",Peng S; Wang D; Liang Y; Xiao W; Zhang Y; Liu L
38201418,"A Systematic Review and Meta-Analysis of Artificial Intelligence Tools in Medicine and Healthcare: Applications, Considerations, Limitations, Motivation and Challenges.",2024,"Diagnostics (Basel, Switzerland)",,,,"Artificial intelligence (AI) has emerged as a transformative force in various sectors, including medicine and healthcare. Large language models like ChatGPT showcase AI's potential by generating human-like text through prompts. ChatGPT's adaptability holds promise for reshaping medical practices, improving patient care, and enhancing interactions among healthcare professionals, patients, and data. In pandemic management, ChatGPT rapidly disseminates vital information. It serves as a virtual assistant in surgical consultations, aids dental practices, simplifies medical education, and aids in disease diagnosis. A total of 82 papers were categorised into eight major areas, which are G1: treatment and medicine, G2: buildings and equipment, G3: parts of the human body and areas of the disease, G4: patients, G5: citizens, G6: cellular imaging, radiology, pulse and medical images, G7: doctors and nurses, and G8: tools, devices and administration. Balancing AI's role with human judgment remains a challenge. A systematic literature review using the PRISMA approach explored AI's transformative potential in healthcare, highlighting ChatGPT's versatile applications, limitations, motivation, and challenges. In conclusion, ChatGPT's diverse medical applications demonstrate its potential for innovation, serving as a valuable resource for students, academics, and researchers in healthcare. Additionally, this study serves as a guide, assisting students, academics, and researchers in the field of medicine and healthcare alike.",Younis HA; Eisa TAE; Nasser M; Sahib TM; Noor AA; Alyasiri OM; Salisu S; Hayder IM; Younis HA
38827766,Application of ChatGPT for Orthopedic Surgeries and Patient Care.,2024,Clinics in orthopedic surgery,,,,"Artificial intelligence (AI) has rapidly transformed various aspects of life, and the launch of the chatbot ""ChatGPT"" by OpenAI in November 2022 has garnered significant attention and user appreciation. ChatGPT utilizes natural language processing based on a ""generative pre-trained transfer"" (GPT) model, specifically the transformer architecture, to generate human-like responses to a wide range of questions and topics. Equipped with approximately 57 billion words and 175 billion parameters from online data, ChatGPT has potential applications in medicine and orthopedics. One of its key strengths is its personalized, easy-to-understand, and adaptive response, which allows it to learn continuously through user interaction. This article discusses how AI, especially ChatGPT, presents numerous opportunities in orthopedics, ranging from preoperative planning and surgical techniques to patient education and medical support. Although ChatGPT's user-friendly responses and adaptive capabilities are laudable, its limitations, including biased responses and ethical concerns, necessitate its cautious and responsible use. Surgeons and healthcare providers should leverage the strengths of the ChatGPT while recognizing its current limitations and verifying critical information through independent research and expert opinions. As AI technology continues to evolve, ChatGPT may become a valuable tool in orthopedic education and patient care, leading to improved outcomes and efficiency in healthcare delivery. The integration of AI into orthopedics offers substantial benefits but requires careful consideration and continuous improvement.",Morya VK; Lee HW; Shahid H; Magar AG; Lee JH; Kim JH; Jun L; Noh KC
37375838,"The Role of AI in Drug Discovery: Challenges, Opportunities, and Strategies.",2023,"Pharmaceuticals (Basel, Switzerland)",,,,"Artificial intelligence (AI) has the potential to revolutionize the drug discovery process, offering improved efficiency, accuracy, and speed. However, the successful application of AI is dependent on the availability of high-quality data, the addressing of ethical concerns, and the recognition of the limitations of AI-based approaches. In this article, the benefits, challenges, and drawbacks of AI in this field are reviewed, and possible strategies and approaches for overcoming the present obstacles are proposed. The use of data augmentation, explainable AI, and the integration of AI with traditional experimental methods, as well as the potential advantages of AI in pharmaceutical research, are also discussed. Overall, this review highlights the potential of AI in drug discovery and provides insights into the challenges and opportunities for realizing its potential in this field. Note from the human authors: This article was created to test the ability of ChatGPT, a chatbot based on the GPT-3.5 language model, in terms of assisting human authors in writing review articles. The text generated by the AI following our instructions (see Supporting Information) was used as a starting point, and its ability to automatically generate content was evaluated. After conducting a thorough review, the human authors practically rewrote the manuscript, striving to maintain a balance between the original proposal and the scientific criteria. The advantages and limitations of using AI for this purpose are discussed in the last section.",Blanco-Gonzalez A; Cabezon A; Seco-Gonzalez A; Conde-Torres D; Antelo-Riveiro P; Pineiro A; Garcia-Fandino R
37809168,"A Call to Address AI ""Hallucinations"" and How Healthcare Professionals Can Mitigate Their Risks.",2023,Cureus,,,,"Artificial intelligence (AI) has transformed society in many ways. AI in medicine has the potential to improve medical care and reduce healthcare professional burnout but we must be cautious of a phenomenon termed ""AI hallucinations""and how this term can lead to the stigmatization of AI systems and persons who experience hallucinations. We believe the term ""AI misinformation"" to be more appropriate and avoids contributing to stigmatization. Healthcare professionals can play an important role in AI's integration into medicine, especially regarding mental health services, so it is important that we continue to critically evaluate AI systems as they emerge.",Hatem R; Simmons B; Thornton JE
39621019,A Survey of Veterinary Student Perceptions on Integrating ChatGPT in Veterinary Education Through AI-Driven Exercises.,2024,Journal of veterinary medical education,,,,"Artificial intelligence (AI) in education is rapidly gaining attention, particularly with tools like ChatGPT, which have the potential to transform learning experiences. However, the application of such tools in veterinary education remains underexplored. This study aimed to design an AI-driven exercise and investigate veterinary students' perceptions regarding the integration of ChatGPT into their education, specifically within the Year 5 Equine Medicine and Surgery course at City University of Hong Kong. Twenty-two veterinary students participated in an AI-driven exercise, where they created multiple-choice questions (MCQs) and evaluated ChatGPT's responses. The exercise was designed to promote active learning and a deeper understanding of complex concepts. The results indicate a generally positive reception, with 72.7% of students finding the exercise moderately to extremely engaging and 77.3% agreeing that it deepened their understanding. Additionally, 68.2% of students reported improvements in their critical thinking skills. Students with prior AI experience exhibited higher engagement levels and perceived the exercise as more effective. The study also found that engagement positively correlated with perceived usefulness, overall satisfaction, and the likelihood of recommending similar AI-driven exercises in other courses. Qualitative feedback underscored the interactive nature of this exercise and its usefulness in helping students understand complex concepts, although some students experienced confusion with AI-generated responses. While acknowledging the limitations of the technology and the small sample size, this study provides valuable insights into the potential benefits and challenges of incorporating AI-driven tools into veterinary education, highlighting the need for carefully considered integration of such tools into the curriculum.",Alonso Sousa S; Flay KJ
38721180,"Redefining Healthcare With Artificial Intelligence (AI): The Contributions of ChatGPT, Gemini, and Co-pilot.",2024,Cureus,,,,"Artificial Intelligence (AI) in healthcare marks a new era of innovation and efficiency, characterized by the emergence of sophisticated language models such as ChatGPT (OpenAI, San Francisco, CA, USA), Gemini Advanced (Google LLC, Mountain View, CA, USA), and Co-pilot (Microsoft Corp, Redmond, WA, USA). This review explores the transformative impact of these AI technologies on various facets of healthcare, from enhancing patient care and treatment protocols to revolutionizing medical research and tackling intricate health science challenges. ChatGPT, with its advanced natural language processing capabilities, leads the way in providing personalized mental health support and improving chronic condition management. Gemini Advanced extends the boundary of AI in healthcare through data analytics, facilitating early disease detection and supporting medical decision-making. Co-pilot, by integrating seamlessly with healthcare systems, optimizes clinical workflows and encourages a culture of innovation among healthcare professionals. Additionally, the review highlights the significant contributions of AI in accelerating medical research, particularly in genomics and drug discovery, thus paving the path for personalized medicine and more effective treatments. The pivotal role of AI in epidemiology, especially in managing infectious diseases such as COVID-19, is also emphasized, demonstrating its value in enhancing public health strategies. However, the integration of AI technologies in healthcare comes with challenges. Concerns about data privacy, security, and the need for comprehensive cybersecurity measures are discussed, along with the importance of regulatory compliance and transparent consent management to uphold ethical standards and patient autonomy. The review points out the necessity for seamless integration, interoperability, and the maintenance of AI systems' reliability and accuracy to fully leverage AI's potential in advancing healthcare.",Alhur A
37466157,The impact of Chat Generative Pre-trained Transformer (ChatGPT) on medical education.,2023,Postgraduate medical journal,,,,"Artificial intelligence (AI) in medicine is developing rapidly. The advent of Chat Generative Pre-trained Transformer (ChatGPT) has taken the world by storm with its potential uses and efficiencies. However, technology leaders, researchers, educators, and policy makers have also sounded the alarm on its potential harms and unintended consequences. AI will increasingly find its way into medicine and is a force of both disruption and innovation. We discuss the potential benefits and limitations of this new league of technology and how medical educators have to develop skills and curricula to best harness this innovative power.",Heng JJY; Teo DB; Tan LF
37094759,Artificial Intelligence and new language models in Ophthalmology: Complications of the use of silicone oil in vitreoretinal surgery.,2023,Archivos de la Sociedad Espanola de Oftalmologia,,,,"Artificial intelligence (AI) is an emerging technology that facilitates everyday tasks and automates tasks in various fields such as medicine. However, the emergence of a language model in academia has generated a lot of interest. This paper evaluates the potential of ChatGPT, a language model developed by OpenAI, and DALL-E 2, an image generator, in the writing of scientific articles in ophthalmology. The selected topic is the complications of the use of silicone oil in vitreoretinal surgery. ChatGPT was used to generate an abstract and a structured article, suggestions for a title and bibliographical references. In conclusion, despite the knowledge demonstrated by this tool, the scientific accuracy and reliability on specific topics is insufficient for the automatic generation of scientifically rigorous articles. In addition, scientists should be aware of the possible ethical and legal implications of these tools.",Valentin-Bravo FJ; Mateos-Alvarez E; Usategui-Martin R; Andres-Iglesias C; Pastor-Jimeno JC; Pastor-Idoate S
37692629,AI-Powered Chatbots in Medical Education: Potential Applications and Implications.,2023,Cureus,,,,"Artificial intelligence (AI) is anticipated to have a considerable impact on the routine practice of medicine, spanning from medical education to clinical practice across specialties and, ultimately, patient care. With the imminent widespread adoption of AI in medical practice, it is imperative that medical schools adapt to the use of these advanced technologies in their curriculum to produce future healthcare professionals who can seamlessly integrate these tools into practice. Chatbots, AI systems programmed to process and generate human language, are currently being evaluated for various tasks in medical education. This paper explores the potential applications and implications of chatbots in medical education, specifically in learning and research. With their capability to summarize, simplify complex concepts, automate the creation of memory aids, and serve as an interactive tutor and point-of-care medical reference, chatbots have the potential to enhance students' comprehension, retention, and application of medical knowledge in real-time. While the integration of AI-powered chatbots in medical education presents numerous advantages, it is crucial for students to use these tools as assistive tools rather than relying on them entirely. Chatbots should be programmed to reference evidence-based medical resources and produce precise and trustworthy content that adheres to medical science standards, scientific writing guidelines, and ethical considerations.",Ghorashi N; Ismail A; Ghosh P; Sidawy A; Javan R
38011011,"""You Are Not Alone"": The Allure and Limitations of Artificial Intelligence in Serious Illness Communication.",2024,Journal of palliative medicine,,,,"Artificial intelligence (AI) is changing the way clinicians practice medicine, and recent technological advancements have resulted in consumer-facing products that can respond to users with dynamic and nuanced language. Clinicians typically struggle with serious illness communication, such as delivering news about a poor prognosis. Palliative care clinicians receive extensive training in serious illness communication, but there is a paucity of such highly trained specialists. This article explores the allure of employing AI-powered chatbots to assist nonspecialist clinicians with serious illness communication and highlights the ethical and practical drawbacks. While outsourcing communication to new AI chatbot technologies may be inappropriate, there is a role for AI in training clinicians on effective language to use when discussing serious illness with their patients.",Burry N; Nakagawa S; Blinderman CD
38472350,"Artificial Intelligence in Plastic Surgery: Analysis of Applications, Perspectives, and Psychological Impact.",2025,Aesthetic plastic surgery,,,,"Artificial intelligence (AI) is emerging as a promising tool in the field of plastic surgery, offering a wide array of applications that enhance surgical outcomes, patient satisfaction, and overall efficiency. This paper explores the utilization of AI, highlighting its various advantages and potential drawbacks. AI-driven technologies such as computer vision, machine learning algorithms, and robotic assistance facilitate preoperative planning, intraoperative guidance, and postoperative monitoring. These advancements enable precise anatomical measurements, personalized treatment plans, and real-time feedback during surgery, leading to improved accuracy and safety. Furthermore, AI-powered image analysis aids in facial recognition, skin texture assessment, and simulation of surgical outcomes, enabling enhanced patient consultations and predictive modeling. However, the integration of AI in plastic surgery also presents challenges, including ethical concerns, data privacy, algorithm biases, and the need for comprehensive training among healthcare professionals. Additionally, the reliance on AI systems may potentially lead to over-reliance or reduced surgeon autonomy, necessitating careful validation and continuous refinement of these technologies. Despite these challenges, the synergistic collaboration between AI and plastic surgery holds great promise in advancing clinical practice, fostering innovation, and ultimately benefiting patients through optimized esthetic and reconstructive outcomes.Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors https://www.springer.com/00266 .",Barone M; De Bernardis R; Persichetti P
39984315,[Artificial intelligence in healthcare: A survival guide for internists].,2025,La Revue de medecine interne,,,,"Artificial intelligence (AI) is experiencing considerable growth in medicine, driven by the explosion of available biomedical data and the emergence of new algorithmic architectures. Applications are rapidly multiplying, from diagnostic assistance to disease progression prediction, paving the way for more personalized medicine. The recent advent of large language models, such as ChatGPT, has particularly interested the medical community, thanks to their ease of use, but also raised questions about their reliability in medical contexts. This review presents the fundamental concepts of medical AI, specifically distinguishing traditional discriminative approaches from new generative models. We detail the different exploitable data sources and methodological pitfalls to avoid during the development of these tools. Finally, we address the practical and ethical implications of this technological revolution, emphasizing the importance of the medical community's appropriation of these tools.",Barba T; Robert M; Hot A
37493985,Artificial intelligence and ChatGPT in Orthopaedics and sports medicine.,2023,Journal of experimental orthopaedics,,,,"Artificial intelligence (AI) is looked upon nowadays as the potential major catalyst for the fourth industrial revolution. In the last decade, AI use in Orthopaedics increased approximately tenfold. Artificial intelligence helps with tracking activities, evaluating diagnostic images, predicting injury risk, and several other uses. Chat Generated Pre-trained Transformer (ChatGPT), which is an AI-chatbot, represents an extremely controversial topic in the academic community. The aim of this review article is to simplify the concept of AI and study the extent of AI use in Orthopaedics and sports medicine literature. Additionally, the article will also evaluate the role of ChatGPT in scientific research and publications.Level of evidence: Level V, letter to review.",Fayed AM; Mansur NSB; de Carvalho KA; Behrens A; D'Hooghe P; de Cesar Netto C
37905264,Decoding Applications of Artificial Intelligence in Rheumatology.,2023,Cureus,,,,"Artificial intelligence (AI) is not a newcomer in medicine. It has been employed for image analysis, disease diagnosis, drug discovery, and improving overall patient care. ChatGPT (Chat Generative Pre-trained Transformer, Inc., Delaware) has renewed interest and enthusiasm in artificial intelligence. Algorithms, machine learning, deep learning, and data analysis are some of the complex terminologies often encountered when health professionals try to learn AI. In this article, we try to review the practical applications of artificial intelligence in vernacular language in the fields of medicine and rheumatology in particular. From the standpoint of the everyday physician, we have endeavored to encapsulate the influence of AI on the cutting edge of medical practice and the potential revolutionary shift in the realm of rheumatology.",Chinnadurai S; Mahadevan S; Navaneethakrishnan B; Mamadapur M
40353268,Get the Artificial Intelligence (AI) Edge in Obstetrics and Gynaecology.,2025,Journal of obstetrics and gynaecology of India,,,,"Artificial intelligence (AI) is on the fast track so far as the growth is concerned, from experimental to implementation in field of medicine. AI should be used ethically and intelligently. Large data base availability, advances in algorithm-theories, computing improvements has led to breakthrough in AI applications in current medicine. Machine learning (ML), which is subset of AI, allows computers to detect patterns through larger data base automatically, that can be used to make predictions. Is it paradigm shift? In the field of Obstetrics and Gynaecology, AI is used in reproductive medicine for diagnosis, treatment with fertility outcome, cancer treatment, USG-MRI images diagnosis, foetal echocardiography, cardiotocography (CTG), preterm labour prediction and urogynaecology. ChatGPT can be helpful in medical writing but there is always a challenge with respect to accuracy and reliability. AI can be used in research and experiments, there by strengthening evidence-based clinical practice. More research is ongoing on personalized diagnosis, treatment and remote medical expert team opinion. It does not replace the medical advice given by the clinician but that should not deter clinician by exploring more uses of AI. Despite various challenges and limitations, integration of AI in medical field is bound to progress in correct direction for better future.",Dalvi S
38323891,Exploring the Promise and Challenges of Artificial Intelligence in Biomedical Research and Clinical Practice.,2024,Journal of cardiovascular pharmacology,,,,"Artificial intelligence (AI) is poised to revolutionize how science, and biomedical research in particular, are done. With AI, problem-solving and complex tasks using massive data sets can be performed at a much higher rate and dimensionality level compared with humans. With the ability to handle huge data sets and self-learn, AI is already being exploited in drug design, drug repurposing, toxicology, and material identification. AI could also be used in both basic and clinical research in study design, defining outcomes, analyzing data, interpreting findings, and even identifying the most appropriate areas of investigation and funding sources. State-of-the-art AI-based large language models, such as ChatGPT and Perplexity, are positioned to change forever how science is communicated and how scientists interact with one another and their profession, including postpublication appraisal and critique. Like all revolutions, upheaval will follow and not all outcomes can be predicted, necessitating guardrails at the onset, especially to minimize the untoward impact of the many drawbacks of large language models, which include lack of confidentiality, risk of hallucinations, and propagation of mainstream albeit potentially mistaken opinions and perspectives. In this review, we highlight areas of biomedical research that are already being reshaped by AI and how AI is likely to affect it further in the near future. We discuss the potential benefits of AI in biomedical research and address possible risks, some surrounding the creative process, that warrant further reflection.",Altara R; Basson CJ; Biondi-Zoccai G; Booz GW
39934059,Generative Artificial Intelligence in Academic Surgery: Ethical Implications and Transformative Potential.,2025,The Journal of surgical research,,,,"Artificial intelligence (AI) is rapidly being used in medicine due to its advanced capabilities in image and video recognition, clinical decision support, surgical education, and administrative task automation. Large language models such as OpenAI's Generative Pretrained Transformer (GPT)-4 and Google's Bard have particularly revolutionized text generation, offering substantial benefits for the academic surgeon, including aiding in manuscript and grant writing. However, integrating AI into academic surgery necessitates addressing ethical concerns such as bias, transparency, and intellectual property. This paper provides guidelines and recommendations based on current literature around the opportunities and ethical challenges of AI in academic surgery. We discuss the underlying mechanisms of large language models, their potential biases, and the importance of responsible usage. Furthermore, we explore the ethical implications of AI in clinical documentation, highlighting improved efficiency and necessary privacy concerns. This review also addresses the critical issue of intellectual property dilemmas posed by AI-generated innovations in university settings. Finally, we propose guidelines for the responsible adoption of AI in academic and clinical environments, stressing the need for transparency, ethical training, and robust governance frameworks to ensure AI enhances, rather than undermines, academic integrity and patient care.",Robinson JR; Stey A; Schneider DF; Kothari AN; Lindeman B; Kaafarani HM; Haines KL
38674356,Innovations in Medicine: Exploring ChatGPT's Impact on Rare Disorder Management.,2024,Genes,,,,"Artificial intelligence (AI) is rapidly transforming the field of medicine, announcing a new era of innovation and efficiency. Among AI programs designed for general use, ChatGPT holds a prominent position, using an innovative language model developed by OpenAI. Thanks to the use of deep learning techniques, ChatGPT stands out as an exceptionally viable tool, renowned for generating human-like responses to queries. Various medical specialties, including rheumatology, oncology, psychiatry, internal medicine, and ophthalmology, have been explored for ChatGPT integration, with pilot studies and trials revealing each field's potential benefits and challenges. However, the field of genetics and genetic counseling, as well as that of rare disorders, represents an area suitable for exploration, with its complex datasets and the need for personalized patient care. In this review, we synthesize the wide range of potential applications for ChatGPT in the medical field, highlighting its benefits and limitations. We pay special attention to rare and genetic disorders, aiming to shed light on the future roles of AI-driven chatbots in healthcare. Our goal is to pave the way for a healthcare system that is more knowledgeable, efficient, and centered around patient needs.",Zampatti S; Peconi C; Megalizzi D; Calvino G; Trastulli G; Cascella R; Strafella C; Caltagirone C; Giardina E
38516933,AI vs academia: Experimental study on AI text detectors' accuracy in behavioral health academic writing.,2024,Accountability in research,,,,"Artificial Intelligence (AI) language models continue to expand in both access and capability. As these models have evolved, the number of academic journals in medicine and healthcare which have explored policies regarding AI-generated text has increased. The implementation of such policies requires accurate AI detection tools. Inaccurate detectors risk unnecessary penalties for human authors and/or may compromise the effective enforcement of guidelines against AI-generated content. Yet, the accuracy of AI text detection tools in identifying human-written versus AI-generated content has been found to vary across published studies. This experimental study used a sample of behavioral health publications and found problematic false positive and false negative rates from both free and paid AI detection tools. The study assessed 100 research articles from 2016-2018 in behavioral health and psychiatry journals and 200 texts produced by AI chatbots (100 by ""ChatGPT"" and 100 by ""Claude""). The free AI detector showed a median of 27.2% for the proportion of academic text identified as AI-generated, while commercial software Originality.AI demonstrated better performance but still had limitations, especially in detecting texts generated by Claude. These error rates raise doubts about relying on AI detectors to enforce strict policies around AI text generation in behavioral health publications.",Popkov AA; Barrett TS
38458774,A Conversation with ChatGPT on Contentious Issues in Senescence and Cancer Research.,2024,Molecular pharmacology,,,,"Artificial intelligence (AI) platforms, such as Generative Pretrained Transformer (ChatGPT), have achieved a high degree of popularity within the scientific community due to their utility in providing evidence-based reviews of the literature. However, the accuracy and reliability of the information output and the ability to provide critical analysis of the literature, especially with respect to highly controversial issues, has generally not been evaluated. In this work, we arranged a question/answer session with ChatGPT regarding several unresolved questions in the field of cancer research relating to therapy-induced senescence (TIS), including the topics of senescence reversibility, its connection to tumor dormancy, and the pharmacology of the newly emerging drug class of senolytics. ChatGPT generally provided responses consistent with the available literature, although occasionally overlooking essential components of the current understanding of the role of TIS in cancer biology and treatment. Although ChatGPT, and similar AI platforms, have utility in providing an accurate evidence-based review of the literature, their outputs should still be considered carefully, especially with respect to unresolved issues in tumor biology. SIGNIFICANCE STATEMENT: Artificial Intelligence platforms have provided great utility for researchers to investigate biomedical literature in a prompt manner. However, several issues arise when it comes to certain unresolved biological questions, especially in the cancer field. This work provided a discussion with ChatGPT regarding some of the yet-to-be-fully-elucidated conundrums of the role of therapy-induced senescence in cancer treatment and highlights the strengths and weaknesses in utilizing such platforms for analyzing the scientific literature on this topic.",Elshazly AM; Shahin U; Al Shboul S; Gewirtz DA; Saleh T
37528607,"ChatGPT: Forensic, legal, and ethical issues.",2024,"Medicine, science, and the law",,,,"Artificial intelligence (AI) refers to a group of technologies that enable people to perform a variety of activities, including observing, comprehending, analysing and translating data, among other things. Nowadays, practically every school of thought is interested in AI. One such innovation, a chatbot by the name of ChatGPT (Chat Generative Pre-Trained Transformer), launched by OpenAI recently, has taken the internet by storm. It had one million users within 1 week of its launch. The present communication explores the practicability and versatility of the ChatGPT in forensic examinations and scenarios, and also addresses the ethical and legal issues surrounding its usage. The observations suggest that the said technology, in its current form, has limited relevance in the realm of forensic science and the law. Only human critical thinking, expertise, and practical experience can provide the information and competencies needed in the realms of forensics, research, clinical and legal practices. Thus, the ChatGPT should be used with utmost caution in the disciplines of medicine, forensic science and the law, irrespective of its many positive attributes.",Guleria A; Krishan K; Sharma V; Kanchan T
37771867,"ChatGPT in action: Harnessing artificial intelligence potential and addressing ethical challenges in medicine, education, and scientific research.",2023,World journal of methodology,,,,"Artificial intelligence (AI) tools, like OpenAI's Chat Generative Pre-trained Transformer (ChatGPT), hold considerable potential in healthcare, academia, and diverse industries. Evidence demonstrates its capability at a medical student level in standardized tests, suggesting utility in medical education, radiology reporting, genetics research, data optimization, and drafting repetitive texts such as discharge summaries. Nevertheless, these tools should augment, not supplant, human expertise. Despite promising applications, ChatGPT confronts limitations, including critical thinking tasks and generating false references, necessitating stringent cross-verification. Ensuing concerns, such as potential misuse, bias, blind trust, and privacy, underscore the need for transparency, accountability, and clear policies. Evaluations of AI-generated content and preservation of academic integrity are critical. With responsible use, AI can significantly improve healthcare, academia, and industry without compromising integrity and research quality. For effective and ethical AI deployment, collaboration amongst AI developers, researchers, educators, and policymakers is vital. The development of domain-specific tools, guidelines, regulations, and the facilitation of public dialogue must underpin these endeavors to responsibly harness AI's potential.",Jeyaraman M; Ramasubramanian S; Balaji S; Jeyaraman N; Nallakumarasamy A; Sharma S
37062612,[Artificial intelligence and internal medicine: The example of hydroxychloroquine according to ChatGPT].,2023,La Revue de medecine interne,,,,"Artificial intelligence (AI) using deep learning is revolutionizing several fields, including medicine, with a wide range of applications. Available since the end of 2022, ChatGPT is a conversational AI or ""chatbot"", using artificial intelligence to dialogue with its users in all fields. Through the example of hydroxychloroquine (HCQ), we discuss its use for patients, clinicians, or researchers, and discuss its performance and limitations, particularly in relation to algorithmic bias. If AI tools using deep learning do not dispense with the expertise and experience of a clinician (at least, for the moment), they have a potential to improve or simplify our daily practice.",Nguyen Y; Costedoat-Chalumeau N
40184826,Artificial Intelligence in Health Care: A Rallying Cry for Critical Clinical Research and Ethical Thinking.,2025,Clinical oncology (Royal College of Radiologists (Great Britain)),,,,"Artificial intelligence (AI) will impact a large proportion of jobs in the short to medium term, especially in the developed countries. The consequences will be felt across many sectors including health care, a critical sector for implementation of AI tools because glitches in algorithms or biases in training datasets may lead to suboptimal treatment that may negatively affect the health of an individual. The stakes are obviously higher in case of potentially life-threatening diseases such as cancer and therapies with a potential for causing severe or even fatal adverse events. Over the last two decades, much of the research on AI in health care has focussed on diagnostic radiology and digital pathology, but a solid body of research is emerging on AI tools in the radiation oncology workflow. Many of these applications are relatively uncontroversial, although there is still a lack of evidence regarding effectiveness rather than efficiency, and-the ultimate bar-evidence of clinical utility. Proponents of AI will argue that these algorithms should be implemented with robust human supervision. One challenge here is the deskilling effect associated with new technologies. We will become increasingly dependent on the AI tools over time, and we will become less capable of assessing the quality of the AI output. Much of this research appears almost old-fashioned in view of the rapid advances in Generative artificial intelligence (GenAI). GenAI can draw from multiple types of data and produce output that is personalised and appears relevant in the given context. Especially the rapid progress in large language models (LLMs) has opened a wide field of potential applications that were out of bounds just a few years ago. One LLM, Generative Pre-trained Transformer 4 (GPT-4), has been made widely accessible to end-users as ChatGPT-4, which passed a rigorous Turing test in a recent study. In this viewpoint, I argue for the necessity of independent academic research to establish evidence-based applications of AI in medicine. Algorithmic medicine is an intervention similar to a new drug or a new medical device. We should be especially concerned about under-represented minorities and rare/atypical clinical cases that may drown in the petabyte-sized training sets. A huge educational push is needed to ensure that the end-users of AI in health care understand the strengths and weaknesses of algorithmic medicine. Finally, we need to address the ethical boundaries for where and when GenAI can replace humans in the relation between patients and healthcare providers.",Bentzen SM
38523987,Ethical Concerns About ChatGPT in Healthcare: A Useful Tool or the Tombstone of Original and Reflective Thinking?,2024,Cureus,,,,"Artificial intelligence (AI), the uprising technology of computer science aiming to create digital systems with human behavior and intelligence, seems to have invaded almost every field of modern life. Launched in November 2022, ChatGPT (Chat Generative Pre-trained Transformer) is a textual AI application capable of creating human-like responses characterized by original language and high coherence. Although AI-based language models have demonstrated impressive capabilities in healthcare, ChatGPT has received controversial annotations from the scientific and academic communities. This chatbot already appears to have a massive impact as an educational tool for healthcare professionals and transformative potential for clinical practice and could lead to dramatic changes in scientific research. Nevertheless, rational concerns were raised regarding whether the pre-trained, AI-generated text would be a menace not only for original thinking and new scientific ideas but also for academic and research integrity, as it gets more and more difficult to distinguish its AI origin due to the coherence and fluency of the produced text. This short review aims to summarize the potential applications and the consequential implications of ChatGPT in the three critical pillars of medicine: education, research, and clinical practice. In addition, this paper discusses whether the current use of this chatbot is in compliance with the ethical principles for the safe use of AI in healthcare, as determined by the World Health Organization. Finally, this review highlights the need for an updated ethical framework and the increased vigilance of healthcare stakeholders to harvest the potential benefits and limit the imminent dangers of this new innovative technology.",Kapsali MZ; Livanis E; Tsalikidis C; Oikonomou P; Voultsos P; Tsaroucha A
37833847,AI language models in human reproduction research: exploring ChatGPT's potential to assist academic writing.,2023,"Human reproduction (Oxford, England)",,,,"Artificial intelligence (AI)-driven language models have the potential to serve as an educational tool, facilitate clinical decision-making, and support research and academic writing. The benefits of their use are yet to be evaluated and concerns have been raised regarding the accuracy, transparency, and ethical implications of using this AI technology in academic publishing. At the moment, Chat Generative Pre-trained Transformer (ChatGPT) is one of the most powerful and widely debated AI language models. Here, we discuss its feasibility to answer scientific questions, identify relevant literature, and assist writing in the field of human reproduction. With consideration of the scarcity of data on this topic, we assessed the feasibility of ChatGPT in academic writing, using data from six meta-analyses published in a leading journal of human reproduction. The text generated by ChatGPT was evaluated and compared to the original text by blinded reviewers. While ChatGPT can produce high-quality text and summarize information efficiently, its current ability to interpret data and answer scientific questions is limited, and it cannot be relied upon for a literature search or accurate source citation due to the potential spread of incomplete or false information. We advocate for open discussions within the reproductive medicine research community to explore the advantages and disadvantages of implementing this AI technology. Researchers and reviewers should be informed about AI language models, and we encourage authors to transparently disclose their use.",Semrl N; Feigl S; Taumberger N; Bracic T; Fluhr H; Blockeel C; Kollmann M
38292297,Establishing priorities for implementation of large language models in pathology and laboratory medicine.,2024,Academic pathology,,,,"Artificial intelligence and machine learning have numerous applications in pathology and laboratory medicine. The release of ChatGPT prompted speculation regarding the potentially transformative role of large-language models (LLMs) in academic pathology, laboratory medicine, and pathology education. Because of the potential to improve LLMs over the upcoming years, pathology and laboratory medicine clinicians are encouraged to embrace this technology, identify pathways by which LLMs may support our missions in education, clinical practice, and research, participate in the refinement of AI modalities, and design user-friendly interfaces that integrate these tools into our most important workflows. Challenges regarding the use of LLMs, which have already received considerable attention in a general sense, are also reviewed herein within the context of the pathology field and are important to consider as LLM applications are identified and operationalized.",Arvisais-Anhalt S; Gonias SL; Murray SG
38170274,Progression of an Artificial Intelligence Chatbot (ChatGPT) for Pediatric Cardiology Educational Knowledge Assessment.,2024,Pediatric cardiology,,,,"Artificial intelligence chatbots, like ChatGPT, have become powerful tools that are disrupting how humans interact with technology. The potential uses within medicine are vast. In medical education, these chatbots have shown improvements, in a short time span, in generalized medical examinations. We evaluated the overall performance and improvement between ChatGPT 3.5 and 4.0 in a test of pediatric cardiology knowledge. ChatGPT 3.5 and ChatGPT 4.0 were used to answer text-based multiple-choice questions derived from a Pediatric Cardiology Board Review textbook. Each chatbot was given an 88 question test, subcategorized into 11 topics. We excluded questions with modalities other than text (sound clips or images). Statistical analysis was done using an unpaired two-tailed t-test. Of the same 88 questions, ChatGPT 4.0 answered 66% of the questions correctly (n = 58/88) which was significantly greater (p < 0.0001) than ChatGPT 3.5, which only answered 38% (33/88). The ChatGPT 4.0 version also did better on each subspeciality topic as compared to ChatGPT 3.5. While acknowledging that ChatGPT does not yet offer subspecialty level knowledge in pediatric cardiology, the performance in pediatric cardiology educational assessments showed a considerable improvement in a short period of time between ChatGPT 3.5 and 4.0.",Gritti MN; AlTurki H; Farid P; Morgan CT
39569947,[ARTIFICIAL INTELLIGENCE AND MEDICAL ETHICS].,2024,Harefuah,,,,"Artificial intelligence has burst into our lives with great vigor in recent years. We encounter it in all areas of life, as well as in the field of medicine. The article refers to medical ethics in two areas: One field is medicine based on Mega Data and the other is the chatbot or ChatGPT. These two fields basically operate in three stages: collecting data, building a logarithm and drawing conclusions and a course of action. During the data collection phase, as doctors we must not forget to preserve the autonomy and medical confidentiality of the patient. Despite all the technology and innovations, in the end, the doctor makes decisions with the cooperation of the patient and the discretion on whether to use the diagnosis, treatment and knowledge which remains in the hands of the doctor. In the realm of research in reviewing materials and writing articles, when artificial intelligence is used, caution and criticality should be exercised, since the results obtained when using artificial intelligence can be doubly misleading.",Karni T
37812965,Improving radiology workflow using ChatGPT and artificial intelligence.,2023,Clinical imaging,,,,"Artificial Intelligence is a branch of computer science that aims to create intelligent machines capable of performing tasks that typically require human intelligence. One of the branches of artificial intelligence is natural language processing, which is dedicated to studying the interaction between computers and human language. ChatGPT is a sophisticated natural language processing tool that can understand and respond to complex questions and commands in natural language. Radiology is a vital aspect of modern medicine that involves the use of imaging technologies to diagnose and treat medical conditions artificial intelligence, including ChatGPT, can be integrated into radiology workflows to improve efficiency, accuracy, and patient care. ChatGPT can streamline various radiology workflow steps, including patient registration, scheduling, patient check-in, image acquisition, interpretation, and reporting. While ChatGPT has the potential to transform radiology workflows, there are limitations to the technology that must be addressed, such as the potential for bias in artificial intelligence algorithms and ethical concerns. As technology continues to advance, ChatGPT is likely to become an increasingly important tool in the field of radiology, and in healthcare more broadly.",Mese I; Taslicay CA; Sivrioglu AK
37864445,Conversation between a clinical biologist and an artificial intelligence on prostate cancer biomarkers: a critical reading.,2023,Annales de biologie clinique,,,,"Artificial intelligence is increasingly used in the field of medicine as a diagnostic aid, particularly for image analysis and more generally for data processing. Many artificial intelligence-based tools have been specifically developed for clinical biology, but some more general ones can help to improve the dissemination of medical knowledge. To test whether and to what extent an automated conversation tool could answer questions on a clinical biology topic (i.e. prostate cancer biomarkers), we questioned ChatGPT, an artificial intelligence-powered model dedicated to optimize language models for dialogue. Then we analyzed its responses.",Lamy PJ
37148260,"Harvesting the Power of Artificial Intelligence for Surgery: Uses, Implications, and Ethical Considerations.",2023,The American surgeon,,,,"Artificial intelligence is rapidly advancing, especially with the advent of ChatGPT technology, and its role in the world of medicine is expanding. Within surgery, AI has the capacity to improve efficiency and results in surgical treatments; however, it similarly has the potential to impose harm onto patients and undermine the role of medical providers. Its benefits may include improvements in surgical outcomes, spanning from enhanced pre-operative diagnostic capabilities to more refined intra-operative techniques, and long term patient experiences, by identifying and reducing complications. Nevertheless apprehensions revolve around laymen use potentially resulting in inappropriate therapeutic interventions, in addition to safety and ethical risks surrounding the use of patient data. Various strategies towards mitigating these harms must be considered, such as patient disclaimers and secondary review policies. While artificial intelligence brings exciting advancements to surgery, its integration must be cautiously monitored.",Kavian JA; Wilkey HL; Patel PA; Boyd CJ
38727914,Application of Deep Learning for Studying NMDA Receptors.,2024,"Methods in molecular biology (Clifton, N.J.)",,,,"Artificial intelligence underwent remarkable advancement in the past decade, revolutionizing our way of thinking and unlocking unprecedented opportunities across various fields, including drug development. The emergence of large pretrained models, such as ChatGPT, has even begun to demonstrate human-level performance in certain tasks.However, the difficulties of deploying and utilizing AI and pretrained model for nonexpert limited its practical use. To overcome this challenge, here we presented three highly accessible online tools based on a large pretrained model for chemistry, the Uni-Mol, for drug development against CNS diseases, including those targeting NMDA receptor: the blood-brain barrier (BBB) permeability prediction, the quantitative structure-activity relationship (QSAR) analysis system, and a versatile interface of the AI-based molecule generation model named VD-gen. We believe that these resources will effectively bridge the gap between cutting-edge AI technology and NMDAR experts, facilitating rapid and rational drug development.",Deng Z; Gu R; Wen H
38836893,ChatGPT: A Conceptual Review of Applications and Utility in the Field of Medicine.,2024,Journal of medical systems,,,,"Artificial Intelligence, specifically advanced language models such as ChatGPT, have the potential to revolutionize various aspects of healthcare, medical education, and research. In this narrative review, we evaluate the myriad applications of ChatGPT in diverse healthcare domains. We discuss its potential role in clinical decision-making, exploring how it can assist physicians by providing rapid, data-driven insights for diagnosis and treatment. We review the benefits of ChatGPT in personalized patient care, particularly in geriatric care, medication management, weight loss and nutrition, and physical activity guidance. We further delve into its potential to enhance medical research, through the analysis of large datasets, and the development of novel methodologies. In the realm of medical education, we investigate the utility of ChatGPT as an information retrieval tool and personalized learning resource for medical students and professionals. There are numerous promising applications of ChatGPT that will likely induce paradigm shifts in healthcare practice, education, and research. The use of ChatGPT may come with several benefits in areas such as clinical decision making, geriatric care, medication management, weight loss and nutrition, physical fitness, scientific research, and medical education. Nevertheless, it is important to note that issues surrounding ethics, data privacy, transparency, inaccuracy, and inadequacy persist. Prior to widespread use in medicine, it is imperative to objectively evaluate the impact of ChatGPT in a real-world setting using a risk-based approach.",Rao SJ; Isath A; Krishnan P; Tangsrivimol JA; Virk HUH; Wang Z; Glicksberg BS; Krittanawong C
37369944,Advancing the Production of Clinical Medical Devices Through ChatGPT.,2024,Annals of biomedical engineering,,,,"As a recently popular large language model, Chatbot Generative Pre-trained Transformer (ChatGPT) is highly valued in the field of clinical medicine. Due to the limited understanding of the potential impact of ChatGPT on the manufacturing side of clinical medical devices, we aim to fill this gap through this article. We elucidate the classification of medical devices and explore the positive contributions of ChatGPT in various aspects of medical device design, optimization, and improvement. However, limitations such as the potential for misinterpretation of user intent, lack of personal experience, and the need for human supervision should be taken into consideration. Striking a balance between ChatGPT and human expertise can ensure the safety, quality, and compliance of medical devices. This work contributes to the advancement of ChatGPT in the medical device manufacturing industry and highlights the synergistic relationship between artificial intelligence and human involvement in healthcare.",Li S; Guo Z; Zang X
37425598,Radiology Gets Chatty: The ChatGPT Saga Unfolds.,2023,Cureus,,,,"As artificial intelligence (AI) continues to evolve and mature, it is increasingly finding applications in the field of healthcare, particularly in specialties like radiology that are data-heavy and image-focused. Language learning models (LLMs) such as OpenAI's Generative Pre-trained Transformer-4 (GPT-4) are new in the field of medicine and there is a paucity of literature regarding the possible utilities of GPT-4 given its novelty. We aim to present an in-depth exploration of the role of GPT-4, an advanced language model, in radiology. Giving the GPT-4 model prompts for generating reports, template generation, enhancing clinical decision-making, and suggesting captivating titles for research articles, patient communication, and education, can occasionally be quite generic, and at times, it may present factually incorrect content, which could lead to errors. The responses were then analyzed in detail regarding their potential utility in day-to-day radiologist workflow, patient education, and research processes. Further research is required to evaluate LLMs' accuracy and safety in clinical practice and to develop comprehensive guidelines for their implementation.",Grewal H; Dhillon G; Monga V; Sharma P; Buddhavarapu VS; Sidhu G; Kashyap R
37790062,A Radiation Oncology Board Exam of ChatGPT.,2023,Cureus,,,,"As artificial intelligence (AI) models improve and become widely integrated into healthcare systems, healthcare providers must understand the strengths and limitations of AI tools to realize the full spectrum of potential patient-care benefits. However, most providers have a poor understanding of AI, leading to distrust and poor adoption of this emerging technology. To bridge this divide, this editorial presents a novel view of ChatGPT's current capabilities in the medical field of radiation oncology. By replicating the format of the oral qualification exam required for radiation oncology board certification, we demonstrate ChatGPT's ability to analyze a commonly encountered patient case, make diagnostic decisions, and integrate information to generate treatment recommendations. Through this simulation, we highlight ChatGPT's strengths and limitations in replicating human decision-making in clinical radiation oncology, while providing an accessible resource to educate radiation oncologists on the capabilities of AI chatbots.",Barbour AB; Barbour TA
37891532,ChatGPT and mycosis- a new weapon in the knowledge battlefield.,2023,BMC infectious diseases,,,,"As current trend for physician tools, ChatGPT can sift through massive amounts of information and solve problems through easy-to-understand conversations, ultimately improving efficiency. Mycosis is currently facing great challenges, including high fungal burdens, high mortality, limited choice of antifungal drugs and increasing drug resistance. To address these challenges, We asked ChatGPT for fungal infection scenario-based questions and assessed its appropriateness, consistency, and potential pitfalls. We concluded ChatGPT can provide compelling responses to most prompts, including diagnosis, recommendations for examination, treatment and rational drug use. Moreover, we summarized exciting future applications in mycosis, such as clinical work, scientific research, education and healthcare. However, the largest barriers to implementation are deficits in indiviudal advice, timely literature updates, consistency, accuracy and data safety. To fully embrace the opportunity, we need to address these barriers and manage the risks. We expect that ChatGPT will become a new weapon in in the battlefield of mycosis.",Jin Y; Liu H; Zhao B; Pan W
37988149,"The Intersection of ChatGPT, Clinical Medicine, and Medical Education.",2023,JMIR medical education,,,,"As we progress deeper into the digital age, the robust development and application of advanced artificial intelligence (AI) technology, specifically generative language models like ChatGPT (OpenAI), have potential implications in all sectors including medicine. This viewpoint article aims to present the authors' perspective on the integration of AI models such as ChatGPT in clinical medicine and medical education. The unprecedented capacity of ChatGPT to generate human-like responses, refined through Reinforcement Learning with Human Feedback, could significantly reshape the pedagogical methodologies within medical education. Through a comprehensive review and the authors' personal experiences, this viewpoint article elucidates the pros, cons, and ethical considerations of using ChatGPT within clinical medicine and notably, its implications for medical education. This exploration is crucial in a transformative era where AI could potentially augment human capability in the process of knowledge creation and dissemination, potentially revolutionizing medical education and clinical practice. The importance of maintaining academic integrity and professional standards is highlighted. The relevance of establishing clear guidelines for the responsible and ethical use of AI technologies in clinical medicine and medical education is also emphasized.",Wong RS; Ming LC; Raja Ali RA
37273063,ChatGPTs' Journey in Medical Revolution: A Potential Panacea or a Hidden Pathogen?,2023,Annals of biomedical engineering,,,,"At the fascinating intersection of artificial intelligence and medicine, ChatGPT morphs into a compact, personal digital physician. With a simple click, it furnishes an abundance of health-related information, initial medical consultations, and a plethora of disease management recommendations. Moreover, it stands at the ready to provide immediate mental health assistance in times of psychological distress. Yet, each innovation carries inherent challenges. As we embrace the conveniences proffered by ChatGPT, it is imperative that we grapple with associated issues such as data privacy, risk of misdiagnosis, complexities in human-machine interaction, and particular situations that elude its understanding. Let's probe further into this intriguing world, brimming with contention and prospects, and observe how ChatGPT traverses the landscape of digital health, uncovering the potential it holds for the future evolution of medical practice.",Yang J
39735146,Categorization of Novel Research Ideas Regarding Adolescent Idiopathic Scoliosis Generated by Artificial Intelligence.,2024,Cureus,,,,"Background  The generation of innovative research ideas is crucial to advancing the field of medicine. As physicians face increasingly demanding clinical schedules, it is important to identify tools that may expedite the research process. Artificial intelligence may offer a promising solution by enabling the efficient generation of novel research ideas. This study aimed to assess the feasibility of using artificial intelligence to build upon existing knowledge by generating innovative research questions.  Methods  A comparative evaluation study was conducted to assess the ability of AI models to generate novel research questions. The prompt ""research ideas for adolescent idiopathic scoliosis"" was input into ChatGPT 3.5, Gemini 1.5, Copilot, and Llama 3. This resulted in an output of several research questions ranging from 10 questions to 14 questions. A keyword-friendly modified version of the AI-generated responses was searched in the PubMed database. Results were limited to manuscripts published in the English language from the year 2000 to the present. Each response was then cross-referenced to the PubMed search results and assigned an originality score of 0-5, with 0 being the most original and 5 being not original at all, by adding one numerical value for each paper already published on the topic. The mean originality scores were calculated manually by summing the originality scores from all the responses from each AI model and then dividing that sum by the respective number of prompts generated by the AI. The standard deviation of the originality scores for each AI was calculated using the standard deviation function (STDEV) function in Google Sheets (Google, Mountain View, California). Each AI was also evaluated on its percent novelty, the percentage of total generated responses that yielded an originality score of 0 when searched in PubMed.  Results  Each AI produced varying numbers of research prompts that were inputted into PubMed. The mean originality scores for ChatGPT, Gemini, Copilot, and Llama were 4.2 +/- 1.9, 4.1 +/- 1.3, 4.0 +/- 1.6, and 3.8 +/- 1.7, respectively. Of ChatGPT's 12 prompts, 16.67% were completely novel (no prior research had been conducted on the topic provided by the AI model). 10.00% of Copilot's 10 prompts were completely novel, and 8.33% of Llama's 12 prompts were completely novel. None of Gemini's fourteen responses yielded an originality score of 0. Conclusions  Our findings demonstrate that ChatGPT, Llama, and Copilot are capable of generating novel ideas in orthopaedics research. As these models continue to evolve and become even more refined with time, physicians and scientists should consider incorporating them when brainstorming and planning their research studies.",Leonardo CJ; Melcer K; Liu SH; Komatsu DE; Barsi JM
39081915,Comparing patient education tools for chronic pain medications: Artificial intelligence chatbot versus traditional patient information leaflets.,2024,Indian journal of anaesthesia,,,,"BACKGROUND AND AIMS: Artificial intelligence (AI) chatbots like Conversational Generative Pre-trained Transformer (ChatGPT) have recently created much buzz, especially regarding patient education. Such informed patients understand and adhere to the management and get involved in shared decision making. The accuracy and understandability of the generated educational material are prime concerns. Thus, we compared ChatGPT with traditional patient information leaflets (PILs) about chronic pain medications. METHODS: Patients' frequently asked questions were generated from PILs available on the official websites of the British Pain Society (BPS) and the Faculty of Pain Medicine. Eight blinded annexures were prepared for evaluation, consisting of traditional PILs from the BPS and AI-generated patient information materials structured similar to PILs by ChatGPT. The authors performed a comparative analysis to assess materials' readability, emotional tone, accuracy, actionability, and understandability. Readability was measured using Flesch Reading Ease (FRE), Gunning Fog Index (GFI), and Flesch-Kincaid Grade Level (FKGL). Sentiment analysis determined emotional tone. An expert panel evaluated accuracy and completeness. Actionability and understandability were assessed with the Patient Education Materials Assessment Tool. RESULTS: Traditional PILs generally exhibited higher readability (P values < 0.05), with [mean (standard deviation)] FRE [62.25 (1.6) versus 48 (3.7)], GFI [11.85 (0.9) versus 13.65 (0.7)], and FKGL [8.33 (0.5) versus 10.23 (0.5)] but varied emotional tones, often negative, compared to more positive sentiments in ChatGPT-generated texts. Accuracy and completeness did not significantly differ between the two. Actionability and understandability scores were comparable. CONCLUSION: While AI chatbots offer efficient information delivery, ensuring accuracy and readability, patient-centeredness remains crucial. It is imperative to balance innovation with evidence-based practice.",Gondode P; Duggal S; Garg N; Sethupathy S; Asai O; Lohakare P
37385548,Harnessing language models for streamlined postcolonoscopy patient management: a novel approach.,2023,Gastrointestinal endoscopy,,,,"BACKGROUND AND AIMS: ChatGPT, an advanced language model, is increasingly used in diverse fields, including medicine. This study explores using ChatGPT to optimize postcolonoscopy management by providing guideline-based recommendations and addressing low compliance rates and timing issues. METHODS: In this proof-of-concept study, 20 clinical scenarios were prepared as structured reports and free-text notes, and ChatGPT's responses were evaluated by 2 senior gastroenterologists. Compliance with guidelines and accuracy were assessed, and inter-rater agreement was calculated using Fleiss' kappa coefficient. RESULTS: ChatGPT exhibited 90% compliance with guidelines and 85% accuracy, with a very good inter-rater agreement (Fleiss' kappa coefficient of .84, P < .01). ChatGPT handled multiple variations and descriptions and crafted concise patient letters. CONCLUSIONS: Results suggest that ChatGPT could aid healthcare providers in making informed decisions and improve compliance with postcolonoscopy surveillance guidelines. Future research should investigate integrating ChatGPT into electronic health record systems and evaluating its effectiveness in different healthcare settings and populations.",Gorelik Y; Ghersin I; Maza I; Klein A
39866721,Can Artificial Intelligence Create an Accurate Colonoscopy Bowel Preparation Prompt?,2025,Gastro hep advances,,,,"BACKGROUND AND AIMS: Colorectal cancer is the third most common cancer in the United States, with colonoscopy being the preferred screening method. Up to 25% of colonoscopies are associated with poor preparation which leads to prolonged procedure time, repeat colonoscopies, and decreased adenoma detection. Artificial intelligence (AI) is being increasingly used in medicine, assessing medical school exam questions, and writing medical reports. Its use in gastroenterology has been used to educate patients with cirrhosis and hepatocellular carcinoma, answer patient questions about colonoscopy and provide correct colonoscopy screening intervals, having the ability to augment the patient-provider relationship. This study aims at assessing the accuracy of a ChatGPT-generated precolonoscopy bowel preparation prompt. METHODS: A nonrandomized cross-sectional study assessing the perceptions of an AI-generated colonoscopy preparation prompt was conducted in a large multisite quaternary health-care institution in the northeast United States. All practicing gastroenterologists in the health system were surveyed, with 208 having a valid email address and were included in the study. A Research Electronic Data Capture survey was then distributed to all participants and analyzed using descriptive statistics. RESULTS: Overall, 91% of gastroenterologist physicians determined the prompt easy to understand, 95% thought the prompt was scientifically accurate and 66% were comfortable giving the prompt to their patients. Sixty four percent of reviewers correctly identified the ChatGPT-generated prompt, but only 32% were confident in their answer. CONCLUSION: The ability of ChatGPT to create a sufficient bowel preparation prompt highlights how physicians can incorporate AI into clinical practice to improve ease and efficiency of communication with patients when it comes to bowel preparation.",Wilkoff MH; Piniella NR; Advani R
40248774,Artificial intelligence enhanced Chatbot boom: A single center observational study to evaluate assistance in clinical anesthesiology.,2025,"Journal of anaesthesiology, clinical pharmacology",,,,"BACKGROUND AND AIMS: The field of anaesthesiology and perioperative medicine has explored advancements in science and technology, ensuring precision and personalized anesthesia plans. The surge in the usage of chat-generative pretrained transformer (Chat GPT) in medicine has evoked interest among anesthesiologists to assess its performance in the operating room. However, there is concern about accuracy, patient privacy and ethics. Our objective in this study assess whether Chat GPT can provide assistance in clinical decisions and compare them with those of resident anesthesiologists. MATERIAL AND METHODS: In this cross-sectional study conducted at a teaching hospital, a set of 30 hypothetical clinical scenarios in the operating room were presented to resident anesthesiologists and Chat-GPT 4. The first five scenarios out of 30 were typed with three additional prompts in the same chat to determine if there was any detailing of answers. The responses were labeled and assessed by three reviewers not involved in the study. RESULTS: The interclass coefficient (ICC) values show variation in the level of agreement between Chat GPT and anesthesiologists. For instance, the ICC of 0.41 between A1 and Chat GPT indicates a moderate level of agreement, whereas the ICC of 0.06 between A2 and Chat GPT suggests a comparatively weaker level of agreement. CONCLUSIONS: In this study, it was found that there were variations in the level of agreement between Chat GPT and resident anesthesiologists' response in terms of accuracy and comprehensiveness of responses in solving intraoperative scenarios. The use of prompts improved the agreement of Chat GPT with anesthesiologists.",Jois SM; Rangalakshmi S; Iyengar SMJ; Mahesh C; Devi LD; Namachivayam AK
38800159,The Potential of ChatGPT for High-Quality Information in Patient Education for Sports Surgery.,2024,Cureus,,,,"BACKGROUND AND OBJECTIVE: Artificial intelligence (AI) advancements continue to have a profound impact on modern society, driving significant innovation and development across various fields. We sought to appraise the reliability of the information offered by Chat Generative Pre-Trained Transformer (ChatGPT) regarding diseases commonly associated with sports surgery. We hypothesized that ChatGPT could offer high-quality information on sports-related diseases and be used in patient education. METHODS: On September 11, 2023, specific sports surgery-related diseases were identified to ask ChatGPT-4 (personal communication, March 4, 2023). The informative texts provided by ChatGPT were recorded by a non-observer senior orthopedic surgeon for this study. Ten texts provided by ChatGPT related to sports surgery diseases were evaluated blindly by two observers. Observers assessed and scored these texts based on the sports surgery-specific scoring (SSSS) and DISCERN criteria. The precision of the disease-related information offered by ChatGPT was evaluated. RESULTS: The calculated average DISCERN score of the texts in the study was 44.75 points and the average SSSS score was 13.3 points. In the interclass correlation coefficient analysis of the measurements made by the observers, the agreement was found to be excellent (0.989; p < 0.001). CONCLUSION: ChatGPT has the potential to be used in patient education for sports surgery-related diseases. The potential to provide quality information in this regard seems to be an advantage.",Yuce A; Erkurt N; Yerli M; Misir A
38107064,Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer?,2023,Frontiers in oncology,,,,"BACKGROUND AND OBJECTIVE: Chat Generative Pre-trained Transformer (ChatGPT) is an artificial intelligence (AI)-based language processing model using deep learning to create human-like text dialogue. It has been a popular source of information covering vast number of topics including medicine. Patient education in head and neck cancer (HNC) is crucial to enhance the understanding of patients about their medical condition, diagnosis, and treatment options. Therefore, this study aims to examine the accuracy and reliability of ChatGPT in answering questions regarding HNC. METHODS: 154 head and neck cancer-related questions were compiled from sources including professional societies, institutions, patient support groups, and social media. These questions were categorized into topics like basic knowledge, diagnosis, treatment, recovery, operative risks, complications, follow-up, and cancer prevention. ChatGPT was queried with each question, and two experienced head and neck surgeons assessed each response independently for accuracy and reproducibility. Responses were rated on a scale: (1) comprehensive/correct, (2) incomplete/partially correct, (3) a mix of accurate and inaccurate/misleading, and (4) completely inaccurate/irrelevant. Discrepancies in grading were resolved by a third reviewer. Reproducibility was evaluated by repeating questions and analyzing grading consistency. RESULTS: ChatGPT yielded ""comprehensive/correct"" responses to 133/154 (86.4%) of the questions whereas, rates of ""incomplete/partially correct"" and ""mixed with accurate and inaccurate data/misleading"" responses were 11% and 2.6%, respectively. There were no ""completely inaccurate/irrelevant"" responses. According to category, the model provided ""comprehensive/correct"" answers to 80.6% of questions regarding ""basic knowledge"", 92.6% related to ""diagnosis"", 88.9% related to ""treatment"", 80% related to ""recovery - operative risks - complications - follow-up"", 100% related to ""cancer prevention"" and 92.9% related to ""other"". There was not any significant difference between the categories regarding the grades of ChatGPT responses (p=0.88). The rate of reproducibility was 94.1% (145 of 154 questions). CONCLUSION: ChatGPT generated substantially accurate and reproducible information to diverse medical queries related to HNC. Despite its limitations, it can be a useful source of information for both patients and medical professionals. With further developments in the model, ChatGPT can also play a crucial role in clinical decision support to provide the clinicians with up-to-date information.",Kuscu O; Pamuk AE; Sutay Suslu N; Hosal S
39828224,Using artificial intelligence to semi-automate trustworthiness assessment of randomized controlled trials: a case study.,2025,Journal of clinical epidemiology,,,,"BACKGROUND AND OBJECTIVE: Randomized controlled trials (RCTs) are the cornerstone of evidence-based medicine. Unfortunately, not all RCTs are based on real data. This serious breach of research integrity compromises the reliability of systematic reviews and meta-analyses, leading to misinformed clinical guidelines and posing a risk to both individual and public health. While methods to detect problematic RCTs have been proposed, they are time-consuming and labor-intensive. The use of artificial intelligence large language models (LLMs) has the potential to accelerate the data collection needed to assess the trustworthiness of published RCTs. METHODS: We present a case study using ChatGPT powered by OpenAI's GPT-4o to assess an RCT paper. The case study focuses on applying the trustworthiness in randomised controlled trials (TRACT checklist) and automating data table extraction to accelerate statistical analysis targeting the trustworthiness of the data. We provide a detailed step-by-step outline of the process, along with considerations for potential improvements. RESULTS: ChatGPT completed all tasks by processing the PDF of the selected publication and responding to specific prompts. ChatGPT addressed items in the TRACT checklist effectively, demonstrating an ability to provide precise ""yes"" or ""no"" answers while quickly synthesizing information from both the paper and relevant online resources. A comparison of results generated by ChatGPT and the human assessor showed an 84% level of agreement of (16/19) TRACT items. This substantially accelerated the qualitative assessment process. Additionally, ChatGPT was able to extract efficiently the data tables as Microsoft Excel worksheets and reorganize the data, with three out of four extracted tables achieving an accuracy score of 100%, facilitating subsequent analysis and data verification. CONCLUSION: ChatGPT demonstrates potential in semiautomating the trustworthiness assessment of RCTs, though in our experience this required repeated prompting from the user. Further testing and refinement will involve applying ChatGPT to collections of RCT papers to improve the accuracy of data capture and lessen the role of the user. The ultimate aim is a completely automated process for large volumes of papers that seems plausible given our initial experience.",Au LS; Qu L; Nielsen J; Ge Z; Gurrin LC; Mol BW; Wang R
40245607,GDReCo: Fine-grained gene-disease relationship extraction corpus.,2025,Computer methods and programs in biomedicine,,,,"BACKGROUND AND OBJECTIVE: Understanding gene-disease relationships is crucial for medical research, drug discovery, clinical diagnosis, and other fields. However, there is currently no high-quality, fine-grained corpus available for training Natural Language Processing (NLP) models, which have proven to be effective in knowledge extraction. METHODS: This study introduces a novel ontology framework for gene-disease associations, addressing the absence of a formal descriptive system and training corpus for NLP models. RESULTS: We developed the Gene Disease Relationship Extraction Corpus (GDReCo), a refined dataset of over 24,000+ cases, including 2300+ manually annotated and 22,000+ model-predicted instances. BERT-based models trained on this data achieved high F1-scores for ""event"" and ""rel"" relationships, validating its effectiveness for Gene-Disease Relationship Extraction (GDRE) tasks. CONCLUSIONS: GDReCo serves as a valuable resource for biomedical research, though ChatGPT's limitations in fine-grained relation extraction are noted.",Yu H; Wu J; Bian S; Zhang S; Wu Y; Zhou Z; Jia Q; Ni Y; Huang Z; Yan H; Wang W; He K; Shi J
39207788,Performance of Language Models on the Family Medicine In-Training Exam.,2024,Family medicine,,,,"BACKGROUND AND OBJECTIVES: Artificial intelligence (AI), such as ChatGPT and Bard, has gained popularity as a tool in medical education. The use of AI in family medicine has not yet been assessed. The objective of this study is to compare the performance of three large language models (LLMs; ChatGPT 3.5, ChatGPT 4.0, and Google Bard) on the family medicine in-training exam (ITE). METHODS: The 193 multiple-choice questions of the 2022 ITE, written by the American Board of Family Medicine, were inputted in ChatGPT 3.5, ChatGPT 4.0, and Bard. The LLMs' performance was then scored and scaled. RESULTS: ChatGPT 4.0 scored 167/193 (86.5%) with a scaled score of 730 out of 800. According to the Bayesian score predictor, ChatGPT 4.0 has a 100% chance of passing the family medicine board exam. ChatGPT 3.5 scored 66.3%, translating to a scaled score of 400 and an 88% chance of passing the family medicine board exam. Bard scored 64.2%, with a scaled score of 380 and an 85% chance of passing the boards. Compared to the national average of postgraduate year 3 residents, only ChatGPT 4.0 surpassed the residents' mean of 68.4%. CONCLUSIONS: ChatGPT 4.0 was the only LLM that outperformed the family medicine postgraduate year 3 residents' national averages on the 2022 ITE, providing robust explanations and demonstrating its potential use in delivering background information on common medical concepts that appear on board exams.",Hanna RE; Smith LR; Mhaskar R; Hanna K
39336540,Perforator Selection with Computed Tomography Angiography for Unilateral Breast Reconstruction: A Clinical Multicentre Analysis.,2024,"Medicina (Kaunas, Lithuania)",,,,"Background and Objectives: Despite CTAs being critical for preoperative planning in autologous breast reconstruction, experienced plastic surgeons may have differing preferences for which side of the abdomen to use for unilateral breast reconstruction. Large language models (LLMs) have the potential to assist medical imaging interpretation. This study compares the perforator selection preferences of experienced plastic surgeons with four popular LLMs based on CTA images for breast reconstruction. Materials and Methods: Six experienced plastic surgeons from Australia, the US, Italy, Denmark, and Argentina reviewed ten CTA images, indicated their preferred side of the abdomen for unilateral breast reconstruction and recommended the type of autologous reconstruction. The LLMs were prompted to do the same. The average decisions were calculated, recorded in suitable tables, and compared. Results: The six consultants predominantly recommend the DIEP procedure (83%). This suggests experienced surgeons feel more comfortable raising DIEP than TRAM flaps, which they recommended only 3% of the time. They also favoured MS TRAM and SIEA less frequently (11% and 2%, respectively). Three LLMs-ChatGPT-4o, ChatGPT-4, and Bing CoPilot-exclusively recommended DIEP (100%), while Claude suggested DIEP 90% and MS TRAM 10%. Despite minor variations in side recommendations, consultants and AI models clearly preferred DIEP. Conclusions: Consultants and LLMs consistently preferred DIEP procedures, indicating strong confidence among experienced surgeons, though LLMs occasionally deviated in recommendations, highlighting limitations in their image interpretation capabilities. This emphasises the need for ongoing refinement of AI-assisted decision support systems to ensure they align more closely with expert clinical judgment and enhance their reliability in clinical practice.",Seth I; Lim B; Phan R; Xie Y; Kenney PS; Bukret WE; Thomsen JB; Cuomo R; Ross RJ; Ng SK; Rozen WM
37888068,"Navigating the Landscape of Personalized Medicine: The Relevance of ChatGPT, BingChat, and Bard AI in Nephrology Literature Searches.",2023,Journal of personalized medicine,,,,"BACKGROUND AND OBJECTIVES: Literature reviews are foundational to understanding medical evidence. With AI tools like ChatGPT, Bing Chat and Bard AI emerging as potential aids in this domain, this study aimed to individually assess their citation accuracy within Nephrology, comparing their performance in providing precise. MATERIALS AND METHODS: We generated the prompt to solicit 20 references in Vancouver style in each 12 Nephrology topics, using ChatGPT, Bing Chat and Bard. We verified the existence and accuracy of the provided references using PubMed, Google Scholar, and Web of Science. We categorized the validity of the references from the AI chatbot into (1) incomplete, (2) fabricated, (3) inaccurate, and (4) accurate. RESULTS: A total of 199 (83%), 158 (66%) and 112 (47%) unique references were provided from ChatGPT, Bing Chat and Bard, respectively. ChatGPT provided 76 (38%) accurate, 82 (41%) inaccurate, 32 (16%) fabricated and 9 (5%) incomplete references. Bing Chat provided 47 (30%) accurate, 77 (49%) inaccurate, 21 (13%) fabricated and 13 (8%) incomplete references. In contrast, Bard provided 3 (3%) accurate, 26 (23%) inaccurate, 71 (63%) fabricated and 12 (11%) incomplete references. The most common error type across platforms was incorrect DOIs. CONCLUSIONS: In the field of medicine, the necessity for faultless adherence to research integrity is highlighted, asserting that even small errors cannot be tolerated. The outcomes of this investigation draw attention to inconsistent citation accuracy across the different AI tools evaluated. Despite some promising results, the discrepancies identified call for a cautious and rigorous vetting of AI-sourced references in medicine. Such chatbots, before becoming standard tools, need substantial refinements to assure unwavering precision in their outputs.",Aiumtrakul N; Thongprayoon C; Suppadungsuk S; Krisanapan P; Miao J; Qureshi F; Cheungpasitporn W
40045700,Artificial intelligence chatbots in transfusion medicine: A cross-sectional study.,2025,Vox sanguinis,,,,"BACKGROUND AND OBJECTIVES: The recent rise of artificial intelligence (AI) chatbots has attracted many users worldwide. However, expert evaluation is essential before relying on them for transfusion medicine (TM)-related information. This study aims to evaluate the performance of AI chatbots for accuracy, correctness, completeness and safety. MATERIALS AND METHODS: Six AI chatbots (ChatGPT 4, ChatGPT 4-o, Gemini Advanced, Copilot, Anthropic Claude 3.5 Sonnet, Meta AI) were tested using TM-related prompts at two time points, 30 days apart. Their responses were assessed by four TM experts. Evaluators' scores underwent inter-rater reliability testing. Responses from Day 30 were compared with those from Day 1 to evaluate consistency and potential evolution over time. RESULTS: All six chatbots exhibited some level of inconsistency and varying degrees of evolution in their responses over 30 days. None provided entirely correct, complete or safe answers to all questions. Among the chatbots tested, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet demonstrated the highest accuracy and consistency, while Microsoft Copilot and Google Gemini Advanced showed the greatest evolution in their responses. As a limitation, the 30-day period may be too short for a precise assessment of chatbot evolution. CONCLUSION: At the time of the conduct of this study, none of the AI chatbots provided fully reliable, complete or safe responses to all TM-related prompts. However, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet show the highest promise for future integration into TM practices. Given their variability and ongoing development, AI chatbots should not yet be relied upon as authoritative sources in TM without expert validation.",Srivastava P; Tewari A; Al-Riyami AZ
37519497,"Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology.",2023,Cureus,,,,"Background Artificial intelligence (AI) is evolving in the medical education system. ChatGPT, Google Bard, and Microsoft Bing are AI-based models that can solve problems in medical education. However, the applicability of AI to create reasoning-based multiple-choice questions (MCQs) in the field of medical physiology is yet to be explored. Objective We aimed to assess and compare the applicability of ChatGPT, Bard, and Bing in generating reasoning-based MCQs for MBBS (Bachelor of Medicine, Bachelor of Surgery) undergraduate students on the subject of physiology. Methods The National Medical Commission of India has developed an 11-module physiology curriculum with various competencies. Two physiologists independently chose a competency from each module. The third physiologist prompted all three AIs to generate five MCQs for each chosen competency. The two physiologists who provided the competencies rated the MCQs generated by the AIs on a scale of 0-3 for validity, difficulty, and reasoning ability required to answer them. We analyzed the average of the two scores using the Kruskal-Wallis test to compare the distribution across the total and module-wise responses, followed by a post-hoc test for pairwise comparisons. We used Cohen's Kappa (Kappa) to assess the agreement in scores between the two raters. We expressed the data as a median with an interquartile range. We determined their statistical significance by a p-value <0.05. Results ChatGPT and Bard generated 110 MCQs for the chosen competencies. However, Bing provided only 100 MCQs as it failed to generate them for two competencies. The validity of the MCQs was rated as 3 (3-3) for ChatGPT, 3 (1.5-3) for Bard, and 3 (1.5-3) for Bing, showing a significant difference (p<0.001) among the models. The difficulty of the MCQs was rated as 1 (0-1) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with a significant difference (p=0.006). The required reasoning ability to answer the MCQs was rated as 1 (1-2) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with no significant difference (p=0.235). K was >/= 0.8 for all three parameters across all three AI models. Conclusion AI still needs to evolve to generate reasoning-based MCQs in medical physiology. ChatGPT, Bard, and Bing showed certain limitations. Bing generated significantly least valid MCQs, while ChatGPT generated significantly least difficult MCQs.",Agarwal M; Sharma P; Goswami A
37303324,Proof of Concept: Using ChatGPT to Teach Emergency Physicians How to Break Bad News.,2023,Cureus,,,,"Background Breaking bad news is an essential skill for practicing physicians, particularly in the field of emergency medicine (EM). Patient-physician communication teaching has previously relied on standardized patient scenarios and objective structured clinical examination formats. The novel use of artificial intelligence (AI) chatbot technology, such as Chat Generative Pre-trained Transformer (ChatGPT), may provide an alternative role in graduate medical education in this area. As a proof of concept, the author demonstrates how providing detailed prompts to the AI chatbot can facilitate the design of a realistic clinical scenario, enable active roleplay, and deliver effective feedback to physician trainees. Methods ChatGPT-3.5 language model was utilized to assist in the roleplay of breaking bad news. A detailed input prompt was designed to outline rules of play and grading assessment via a standardized scale. User inputs (physician role), chatbot outputs (patient role) and ChatGPT-generated feedback were recorded. Results ChatGPT set up a realistic training scenario on breaking bad news based on the initial prompt. Active roleplay as a patient in an emergency department setting was accomplished, and clear feedback was provided to the user through the application of the Setting up, Perception, Invitation, Knowledge, Emotions with Empathy, and Strategy or Summary (SPIKES) framework for breaking bad news. Conclusion The novel use of AI chatbot technology to assist educators is abundant with potential. ChatGPT was able to design an appropriate scenario, provide a means for simulated patient-physician roleplay, and deliver real-time feedback to the physician user. Future studies are required to expand use to a targeted group of EM physician trainees and provide best practice guidelines for AI use in graduate medical education.",Webb JJ
37073184,The Capability of ChatGPT in Predicting and Explaining Common Drug-Drug Interactions.,2023,Cureus,,,,"Background Drug-drug interactions (DDIs) can have serious consequences for patient health and well-being. Patients who are taking multiple medications may be at an increased risk of experiencing adverse events or drug toxicity if they are not aware of potential interactions between their medications. Many times, patients self-prescribe medications without knowing DDI. Objective The objective is to investigate the effectiveness of ChatGPT, a large language model, in predicting and explaining common DDIs. Methods A total of 40 DDIs lists were prepared from previously published literature. This list was used to converse with ChatGPT with a two-stage question. The first question was asked as ""can I take X and Y together?"" with two drug names. After storing the output, the next question was asked. The second question was asked as ""why should I not take X and Y together?"" The output was stored for further analysis. The responses were checked by two pharmacologists and the consensus output was categorized as ""correct"" and ""incorrect."" The ""correct"" ones were further classified as ""conclusive"" and ""inconclusive."" The text was checked for reading ease scores and grades of education required to understand the text. Data were tested by descriptive and inferential statistics. Results Among the 40 DDI pairs, one answer was incorrect in the first question. Among correct answers, 19 were conclusive and 20 were inconclusive. For the second question, one answer was wrong. Among correct answers, 17 were conclusive and 22 were inconclusive. The mean Flesch reading ease score was 27.64+/-10.85 in answers to the first question and 29.35+/-10.16 in answers to the second question, p = 0.47. The mean Flesh-Kincaid grade level was 15.06+/-2.79 in answers to the first question and 14.85+/-1.97 in answers to the second question, p = 0.69. When we compared the reading levels with hypothetical 6th grade, the grades were significantly higher than expected (t = 20.57, p < 0.0001 for first answers and t = 28.43, p < 0.0001 for second answers). Conclusion ChatGPT is a partially effective tool for predicting and explaining DDIs. Patients, who may not have immediate access to the healthcare facility for getting information about DDIs, may take help from ChatGPT. However, on several occasions, it may provide incomplete guidance. Further improvement is required for potential usage by patients for getting ideas about DDI.",Juhi A; Pipil N; Santra S; Mondal S; Behera JK; Mondal H
39845219,Evaluating the Accuracy of ChatGPT in the Japanese Board-Certified Physiatrist Examination.,2024,Cureus,,,,"Background Generative artificial intelligence (AI), such as Chat Generative Pre-trained Transformer (ChatGPT), has shown potential in various medical applications, including answering licensing examination questions. However, its performance in rehabilitation medicine remains underexplored. This study aimed to evaluate the accuracy of ChatGPT4o in answering questions from the Japanese Board-Certified Physiatrist Examination and assess its potential as an educational and clinical support tool. Methods This study assessed the performance of ChatGPT4o on questions from the 2021-2023 Japanese Board-Certified Physiatrist Examinations. Questions were categorized into text- and image-based types and correct response rates were calculated. Errors were classified into informational, logical, or statistical. Results ChatGPT4o achieved correct response rates of 79.1% in 2021, 80.0% in 2022, and 86.3% in 2023, with an overall accuracy of 81.8%. The AI performed better on text-based (83.0%) than on image-based (70.0%) questions. Most errors (92.8%) were related to information. Conclusions ChatGPT4o demonstrated high accuracy in the Japanese Board-Certified Physiatrist Examination, particularly for text-based questions, demonstrating its potential as an educational tool. However, limitations in image interpretation and specialized topics indicate the need for further improvements for clinical application.",Kato Y; Ushida K; Momosaki R
37143631,Evaluating ChatGPT's Ability to Solve Higher-Order Questions on the Competency-Based Medical Education Curriculum in Medical Biochemistry.,2023,Cureus,,,,"Background Healthcare-related artificial intelligence (AI) is developing. The capacity of the system to carry out sophisticated cognitive processes, such as problem-solving, decision-making, reasoning, and perceiving, is referred to as higher cognitive thinking in AI. This kind of thinking requires more than just processing facts; it also entails comprehending and working with abstract ideas, evaluating and applying data relevant to the context, and producing new insights based on prior learning and experience. ChatGPT is an artificial intelligence-based conversational software that can engage with people to answer questions and uses natural language processing models. The platform has created a worldwide buzz and keeps setting an ongoing trend in solving many complex problems in various dimensions. Nevertheless, ChatGPT's capacity to correctly respond to queries requiring higher-level thinking in medical biochemistry has not yet been investigated. So, this research aimed to evaluate ChatGPT's aptitude for responding to higher-order questions on medical biochemistry. Objective In this study, our objective was to determine whether ChatGPT can address higher-order problems related to medical biochemistry.​​​​​​ Methods​​​ This cross-sectional study was done online by conversing with the current version of ChatGPT (14 March 2023, which is presently free for registered users). It was presented with 200 medical biochemistry reasoning questions that require higher-order thinking. These questions were randomly picked from the institution's question bank and classified according to the Competency-Based Medical Education (CBME) curriculum's competency modules. The responses were collected and archived for subsequent research. Two expert biochemistry academicians examined the replies on a zero to five scale. The score's accuracy was determined by a one-sample Wilcoxon signed rank test using hypothetical values. Result The AI software answered 200 questions requiring higher-order thinking with a median score of 4.0 (Q1=3.50, Q3=4.50). Using a single sample Wilcoxon signed rank test, the result was less than the hypothetical maximum of five (p=0.001) and comparable to four (p=0.16). There was no difference in the replies to questions from different CBME modules in medical biochemistry (Kruskal-Wallis p=0.39). The inter-rater reliability of the scores scored by two biochemistry faculty members was outstanding (ICC=0.926 (95% CI: 0.814-0.971); F=19; p=0.001)​​​​​​ Conclusion The results of this research indicate that ChatGPT has the potential to be a successful tool for answering questions requiring higher-order thinking in medical biochemistry, with a median score of four out of five. However, continuous training and development with data of recent advances are essential to improve performance and make it functional for the ever-growing field of academic medical usage.",Ghosh A; Bir A
38111405,"ChatGPT-4 and Human Researchers Are Equal in Writing Scientific Introduction Sections: A Blinded, Randomized, Non-inferiority Controlled Study.",2023,Cureus,,,,"Background Natural language processing models are increasingly used in scientific research, and their ability to perform various tasks in the research process is rapidly advancing. This study aims to investigate whether Generative Pre-trained Transformer 4 (GPT-4) is equal to humans in writing introduction sections for scientific articles. Methods This randomized non-inferiority study was reported according to the Consolidated Standards of Reporting Trials for non-inferiority trials and artificial intelligence (AI) guidelines. GPT-4 was instructed to synthesize 18 introduction sections based on the aim of previously published studies, and these sections were compared to the human-written introductions already published in a medical journal. Eight blinded assessors randomly evaluated the introduction sections using 1-10 Likert scales. Results There was no significant difference between GPT-4 and human introductions regarding publishability and content quality. GPT-4 had one point significantly better scores in readability, which was considered a non-relevant difference. The majority of assessors (59%) preferred GPT-4, while 33% preferred human-written introductions. Based on Lix and Flesch-Kincaid scores, GPT-4 introductions were 10 and two points higher, respectively, indicating that the sentences were longer and had longer words. Conclusion GPT-4 was found to be equal to humans in writing introductions regarding publishability, readability, and content quality. The majority of assessors preferred GPT-4 introductions and less than half could determine which were written by GPT-4 or humans. These findings suggest that GPT-4 can be a useful tool for writing introduction sections, and further studies should evaluate its ability to write other parts of scientific articles.",Sikander B; Baker JJ; Deveci CD; Lund L; Rosenberg J
38238871,AI in the repurposing of potential herbs for filariasis therapy.,2024,Journal of vector borne diseases,,,,"BACKGROUND OBJECTIVES: The goal of this study was to see how well an AI language model called Chat Generative Pre-trained Transformer (ChatGPT) assisted healthcare personnel in selecting relevant medications for filariasis therapy. A team of medical specialists and tropical medicine experts reviewed ChatGPT's recommendations for ten hypothetical filariasis clinical situations. METHODS: The purpose of this study was to look at the effectiveness of an AI language model called Chat Generative Pre-trained Transformer (ChatGPT) in supporting healthcare providers in picking appropriate drugs for filariasis treatment. Ten hypothetical filariasis clinical cases were submitted to ChatGPT, and its recommendations were evaluated by a panel of medical professionals and tropical medicine experts. RESULTS: ChatGPT gave appropriate suggestions for potential medication repurposing in filariasis treatment in all ten clinical scenarios. Its drug recommendations were in line with current medical research and literature. Despite the lack of particular treatment regimens, ChatGPT's general ideas proved useful for healthcare practitioners, providing insights and updates on prospective drug repurposing tactics. INTERPRETATION CONCLUSION: ChatGPT shows promise as a useful method for repurposing drugs in the treatment of filariasis. Its thorough and brief responses make it useful for finding possible pharmacological candidates. However, it is critical to recognize ChatGPT's limitations, such as the requirement for additional clinical information and the inability to change therapy. Further research and development are required to optimize its use in filariasis therapy settings.",Wiwanitmkit S; Wiwanitkit V
39404623,Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings.,2024,Radiology,,,,"Background The burgeoning interest in ChatGPT as a potentially useful tool in medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Purpose To evaluate the accuracy, reliability, and repeatability of differential diagnoses produced by ChatGPT from transcribed radiologic findings. Materials and Methods Cases selected from a radiology textbook series spanning a variety of imaging modalities, subspecialties, and anatomic pathologies were converted into standardized prompts that were entered into ChatGPT (GPT-3.5 and GPT-4 algorithms; April 3 to June 1, 2023). Responses were analyzed for accuracy via comparison with the final diagnosis and top 3 differential diagnosis provided in the textbook, which served as the ground truth. Reliability, defined based on the frequency of algorithmic hallucination, was assessed through the identification of factually incorrect statements and fabricated references. Comparisons were made between the algorithms using the McNemar test and a generalized estimating equation model framework. Test-retest repeatability was measured by obtaining 10 independent responses from both algorithms for 10 cases in each subspecialty, and calculating the average pairwise percent agreement and Krippendorff alpha. Results A total of 339 cases were collected across multiple radiologic subspecialties. The overall accuracy of GPT-3.5 and GPT-4 for final diagnosis was 53.7% (182 of 339) and 66.1% (224 of 339; P < .001), respectively. The mean differential score (ie, proportion of top 3 diagnoses that matched the original literature differential diagnosis) for GPT-3.5 and GPT-4 was 0.50 and 0.54 (P = .06), respectively. Of the references provided in GPT-3.5 and GPT-4 responses, 39.9% (401 of 1006) and 14.3% (161 of 1124; P < .001), respectively, were fabricated. GPT-3.5 and GPT-4 generated false statements in 16.2% (55 of 339) and 4.7% (16 of 339; P < .001) of cases, respectively. The range of average pairwise percent agreement across subspecialties for the final diagnosis and top 3 differential diagnosis was 59%-98% and 23%-49%, respectively. Conclusion ChatGPT achieved the best results when the most up-to-date model (GPT-4) was used and when it was prompted for a single diagnosis. Hallucination frequency was lower with GPT-4 than with GPT-3.5, but repeatability was an issue for both models. (c) RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.",Sun SH; Huynh K; Cortes G; Hill R; Tran J; Yeh L; Ngo AL; Houshyar R; Yaghmai V; Tran M
40416223,Performance of GPT-4o and DeepSeek-R1 in the Polish Infectious Diseases Specialty Exam.,2025,Cureus,,,,"Background The past few years have been a time of rapid development in artificial intelligence (AI) and its implementation across numerous fields. This study aimed to compare the performance of GPT-4o (OpenAI, San Francisco, CA, USA) and DeepSeek-R1 (DeepSeek AI, Zhejiang, China) on the Polish specialty examination in infectious diseases. Materials and methods The study was conducted from April 1 to April 4, 2025, using the Autumn 2024 Polish specialty examination in infectious diseases. The examination comprised 120 questions, each presenting five answer options, with only one correct choice. The Center for Medical Education (CEM) in Lodz, Poland decided to withdraw one question due to the absence of a definitive correct answer and inconsistency with up-to-date clinical guidelines. Furthermore, the questions were classified as either 'clinical cases' or 'other' to enable a more in-depth evaluation of the potential of artificial intelligence in real-world clinical practice. The accuracy of the responses was verified using the official answer key approved by the CEM. To assess the accuracy and confidence level of the responses provided by GPT-4o and DeepSeek-R1, statistical methods were employed, including Pearson's chi(2) test, and Mann-Whitney U test. Results GPT-4o correctly answered 85 out of 199 questions (71.43%) while DeepSeek-R1 answered correctly 88 out of 199 questions (73.85%). A minimum of 72 (60.5%) correct responses is required to pass the examination. No statistically significant difference was observed between responses to 'clinical case' questions and 'other' questions for either AI model. For both AI models, a statistically significant difference was observed in the confidence levels between correct and incorrect answers, with higher confidence reported for correctly answered questions and lower confidence for incorrectly answered ones. Conclusions Both GPT-4o and DeepSeek-R1 demonstrated the ability to pass the Polish specialty examination in infectious diseases, suggesting their potential as educational tools. Additionally, it is noteworthy that DeepSeek-R1 achieved a performance comparable to GPT-4o, despite being a much newer model on the market and, according to available data, having been developed at significantly lower cost.",Blecha Z; Jasinski D; Jaworski A; Latkowska A; Jaworski W; Syslo O; Rubik N; Jastrzebska I; Harazinski K; Goliat W; Gmur M; Gajewski M; Slawinska B; Maryniak N
39156244,Performance of ChatGPT in Solving Questions From the Progress Test (Brazilian National Medical Exam): A Potential Artificial Intelligence Tool in Medical Practice.,2024,Cureus,,,,"Background The use of artificial intelligence (AI) is not a recent phenomenon, but the latest advancements in this technology are making a significant impact across various fields of human knowledge. In medicine, this trend is no different, although it has developed at a slower pace. ChatGPT is an example of an AI-based algorithm capable of answering questions, interpreting phrases, and synthesizing complex information, potentially aiding and even replacing humans in various areas of social interest. Some studies have compared its performance in solving medical knowledge exams with medical students and professionals to verify AI accuracy. This study aimed to measure the performance of ChatGPT in answering questions from the Progress Test from 2021 to 2023. Methodology An observational study was conducted in which questions from the 2021 Progress Test and the regional tests (Southern Institutional Pedagogical Support Center II) of 2022 and 2023 were presented to ChatGPT 3.5. The results obtained were compared with the scores of first- to sixth-year medical students from over 120 Brazilian universities. All questions were presented sequentially, without any modification to their structure. After each question was presented, the platform's history was cleared, and the site was restarted. Results The platform achieved an average accuracy rate in 2021, 2022, and 2023 of 69.7%, 68.3%, and 67.2%, respectively, surpassing students from all medical years in the three tests evaluated, reinforcing findings in the current literature. The subject with the best score for the AI was Public Health, with a mean grade of 77.8%. Conclusions ChatGPT demonstrated the ability to answer medical questions with higher accuracy than humans, including students from the last year of medical school.",Rodrigues Alessi M; Gomes HA; Lopes de Castro M; Terumy Okamoto C
39364514,Evaluating the Adherence of Large Language Models to Surgical Guidelines: A Comparative Analysis of Chatbot Recommendations and North American Spine Society (NASS) Coverage Criteria.,2024,Cureus,,,,"Background There has been a significant increase in cervical fusion procedures, both anterior and posterior, across the United States. Despite this upward trend, limited research exists on adherence to evidence-based medicine (EBM) guidelines for cervical fusion, highlighting a gap between recommended practices and surgeon preferences. Additionally, patients are increasingly utilizing large language models (LLMs) to aid in decision-making. Methodology This observational study evaluated the capacity of four LLMs, namely, Bard, BingAI, ChatGPT-3.5, and ChatGPT-4, to adhere to EBM guidelines, specifically the 2023 North American Spine Society (NASS) cervical fusion guidelines. Ten clinical vignettes were created based on NASS recommendations to determine when fusion was indicated. This novel approach assessed LLM performance in a clinical decision-making context without requiring institutional review board approval, as no human subjects were involved. Results No LLM achieved complete concordance with NASS guidelines, though ChatGPT-4 and Bing Chat exhibited the highest adherence at 60%. Discrepancies were notably observed in scenarios involving head-drop syndrome and pseudoarthrosis, where all LLMs failed to align with NASS recommendations. Additionally, only 25% of LLMs agreed with NASS guidelines for fusion in cases of cervical radiculopathy and as an adjunct to facet cyst resection. Conclusions The study underscores the need for improved LLM training on clinical guidelines and emphasizes the importance of considering the nuances of individual patient cases. While LLMs hold promise for enhancing guideline adherence in cervical fusion decision-making, their current performance indicates a need for further refinement and integration with clinical expertise to ensure optimal patient care. This study contributes to understanding the role of AI in healthcare, advocating for a balanced approach that leverages technological advancements while acknowledging the complexities of surgical decision-making.",Sarikonda A; Isch E; Self M; Sambangi A; Carreras A; Sivaganesan A; Harrop J; Jallo J
38953081,Using ChatGPT in the Development of Clinical Reasoning Cases: A Qualitative Study.,2024,Cureus,,,,"Background There has been an explosion of commentary and discussion about the ethics and utility of using artificial intelligence in medicine, and its practical use in medical education is still being debated. Through qualitative research methods, this study aims to highlight the advantages and pitfalls of using ChatGPT in the development of clinical reasoning cases for medical student education. Methods Five highly experienced faculty in medical education were provided instructions to create unique clinical reasoning cases for three different chief concerns using ChatGPT 3.0. Faculty were then asked to reflect on and review the created cases. Finally, a focus group was conducted to further analyze and describe their experiences with the new technology. Results Overall, faculty found the use of ChatGPT in the development of clinical reasoning cases easy to use but difficult to get to certain objectives and largely incapable of being creative enough to create complexity for student use without heavy editing. The created cases did provide a helpful starting point and were extremely efficient; however, faculty did experience some medical inaccuracies and fact fabrication. Conclusion There is value to using ChatGPT to develop curricular content, especially for clinical reasoning cases, but it needs to be comprehensively reviewed and verified. To efficiently and effectively utilize the tool, educators will need to develop a framework that can be easily translatable into simple prompts that ChatGPT can understand. Future work will need to strongly consider the risks of recirculating biases and misinformation.",Wong K; Fayngersh A; Traba C; Cennimo D; Kothari N; Chen S
37303347,Exploring ChatGPT's Potential in Facilitating Adaptation of Clinical Guidelines: A Case Study of Diabetic Ketoacidosis Guidelines.,2023,Cureus,,,,"Background This study aimed to evaluate the efficacy of ChatGPT, an advanced natural language processing model, in adapting and synthesizing clinical guidelines for diabetic ketoacidosis (DKA) by comparing and contrasting different guideline sources. Methodology We employed a comprehensive comparison approach and examined three reputable guideline sources: Diabetes Canada Clinical Practice Guidelines Expert Committee (2018), Emergency Management of Hyperglycaemia in Primary Care, and Joint British Diabetes Societies (JBDS) 02 The Management of Diabetic Ketoacidosis in Adults. Data extraction focused on diagnostic criteria, risk factors, signs and symptoms, investigations, and treatment recommendations. We compared the synthesized guidelines generated by ChatGPT and identified any misreporting or non-reporting errors. Results ChatGPT was capable of generating a comprehensive table comparing the guidelines. However, multiple recurrent errors, including misreporting and non-reporting errors, were identified, rendering the results unreliable. Additionally, inconsistencies were observed in the repeated reporting of data. The study highlights the limitations of using ChatGPT for the adaptation of clinical guidelines without expert human intervention. Conclusions Although ChatGPT demonstrates the potential for the synthesis of clinical guidelines, the presence of multiple recurrent errors and inconsistencies underscores the need for expert human intervention and validation. Future research should focus on improving the accuracy and reliability of ChatGPT, as well as exploring its potential applications in other areas of clinical practice and guideline development.",Hamed E; Eid A; Alberry M
39371744,GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.,2024,Cureus,,,,"Background This study aims to evaluate the performance of OpenAI's GPT-4o in the Polish Final Dentistry Examination (LDEK) and compare it with human candidates' results. The LDEK is a standardized test essential for dental graduates in Poland to obtain their professional license. With artificial intelligence (AI) becoming increasingly integrated into medical and dental education, it is important to assess AI's capabilities in such high-stakes examinations. Materials and methods The study was conducted from August 1 to August 15, 2024, using the Spring 2023 LDEK exam. The exam comprised 200 multiple-choice questions, each with one correct answer among five options. Questions spanned various dental disciplines, including Conservative Dentistry with Endodontics, Pediatric Dentistry, Dental Surgery, Prosthetic Dentistry, Periodontology, Orthodontics, Emergency Medicine, Bioethics and Medical Law, Medical Certification, and Public Health. The exam organizers withdrew one question. GPT-4o was tested on these questions without access to the publicly available question bank. The AI model's responses were recorded, and each answer's confidence level was assessed. Correct answers were determined based on the official key provided by the Center for Medical Education (CEM) in Lodz, Poland. Statistical analyses, including Pearson's chi-square test and the Mann-Whitney U test, were performed to evaluate the accuracy and confidence of ChatGPT's answers across different dental fields. Results GPT-4o correctly answered 141 out of 199 valid questions (70.85%) and incorrectly answered 58 (29.15%). The AI performed better in fields like Conservative Dentistry with Endodontics (71.74%) and Prosthetic Dentistry (80%) but showed lower accuracy in Pediatric Dentistry (62.07%) and Orthodontics (52.63%). A statistically significant difference was observed between ChatGPT's performance on clinical case-based questions (36.36% accuracy) and other factual questions (72.87% accuracy), with a p-value of 0.025. Confidence levels also varied significantly between correct and incorrect answers, with a p-value of 0.0208. Conclusions GPT-4o's performance in the LDEK suggests it has potential as a supplementary educational tool in dentistry. However, the AI's limited clinical reasoning abilities, especially in complex scenarios, reveal a substantial gap between AI and human expertise. While ChatGPT demonstrates strong performance in factual recall, it cannot yet match the critical thinking and clinical judgment exhibited by human candidates.",Jaworski A; Jasinski D; Slawinska B; Blecha Z; Jaworski W; Kruplewicz M; Jasinska N; Syslo O; Latkowska A; Jung M
38618446,Performance of ChatGPT vs. HuggingChat on OB-GYN Topics.,2024,Cureus,,,,"Background While large language models show potential as beneficial tools in medicine, their reliability, especially in the realm of obstetrics and gynecology (OB-GYN), is not fully comprehended. This study seeks to measure and contrast the performance of ChatGPT and HuggingChat in addressing OB-GYN-related medical examination questions, offering insights into their effectiveness in this specialized field. Methods ChatGPT and HuggingChat were subjected to two standardized multiple-choice question banks: Test 1, developed by the National Board of Medical Examiners (NBME), and Test 2, gathered from the Association of Professors of Gynecology & Obstetrics (APGO) Web-Based Interactive Self-Evaluation (uWISE). Responses were analyzed and compared for correctness. Results The two-proportion z-test revealed no statistically significant difference in performance between ChatGPT and HuggingChat on both medical examinations. For Test 1, ChatGPT scored 90%, while HuggingChat scored 85% (p = 0.6). For Test 2, ChatGPT correctly answered 70% of questions, while HuggingChat correctly answered 62% of questions (p = 0.4). Conclusion Awareness of the strengths and weaknesses of artificial intelligence allows for the proper and effective use of its knowledge. Our findings indicate that there is no statistically significant difference in performance between ChatGPT and HuggingChat in addressing medical inquiries. Nonetheless, both platforms demonstrate considerable promise for applications within the medical domain.",Kirshteyn G; Golan R; Chaet M
39421095,Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.,2024,Cureus,,,,"Background With advancements in natural language processing, tools such as Chat Generative Pre-Trained Transformers (ChatGPT) version 4.0 and Google Bard's Gemini Advanced are being increasingly evaluated for their potential in various medical applications. The objective of this study was to systematically assess the performance of these language learning models (LLMs) on both image and non-image-based questions within the specialized field of Ophthalmology. We used a review question bank for the Ophthalmic Knowledge Assessment Program (OKAP) used by ophthalmology residents nationally to prepare for the Ophthalmology Board Exam to assess the accuracy and performance of ChatGPT and Gemini Advanced. Methodology A total of 260 randomly generated multiple-choice questions from the OphthoQuestions question bank were run through ChatGPT and Gemini Advanced. A simulated 260-question OKAP examination was created at random from the bank. Question-specific data were analyzed, including overall percent correct, subspecialty accuracy, whether the question was ""high yield,"" difficulty (1-4), and question type (e.g., image, text). To compare the performance of ChatGPT-4 and Gemini on the difficulty of questions, we utilized the standard deviation of user answer choices to determine question difficulty. In this study, a statistical analysis of Google Sheets was conducted using two-tailed t-tests with unequal variance to compare the performance of ChatGPT-4.0 and Google's Gemini Advanced across various question types, subspecialties, and difficulty levels. Results In total, 259 of the 260 questions were included in the study as one question used a video that any form of ChatGPT could not interpret as of May 1, 2024. For text-only questions, ChatGPT-4.0.0 correctly answered 57.14% (148/259, p < 0.018), and Gemini Advanced correctly answered 46.72% (121/259, p < 0.018). Both versions answered most questions without a prompt and would have received a below-average score on the OKAP. Moreover, there were 27 questions utilizing a secondary prompt in ChatGPT-4.0 compared to 67 questions in Gemini Advanced. ChatGPT-4.0 performed 68.99% on easier questions (<2 on a scale from 1-4) and 44.96% on harder questions (>2 on a scale from 1-4). On the other hand, Gemini Advanced performed 49.61% on easier questions (<2 on a scale from 1-4) and 44.19% on harder questions (>2 on a scale from 1-4). There was a statistically significant difference in accuracy between ChatGPT-4.0 and Gemini Advanced for easy (p < 0.0015) but not for hard (p < 0.55) questions. For image-only questions, ChatGPT-4.0 correctly answered 39.58% (19/48, p < 0.013), and Gemini Advanced correctly answered 33.33% (16/48, p < 0.022), resulting in a statistically insignificant difference in accuracy between ChatGPT-4.0 and Gemini Advanced (p < 0.530). A comparison against text-only and image-based questions demonstrated a statistically significant difference in accuracy for both ChatGPT-4.0 (p < 0.013) and Gemini Advanced (p < 0.022). Conclusions This study provides evidence that ChatGPT-4.0 performs better on the OKAP-style exams and is improved compared to Gemini Advanced within the context of ophthalmic multiple-choice questions. This may show an opportunity for increased worth for ChatGPT in ophthalmic medical education. While showing promise within medical education, caution should be used as a more detailed evaluation of reliability is needed.",Gill GS; Tsai J; Moxam J; Sanghvi HA; Gupta S
39878409,Use of ChatGPT Large Language Models to Extract Details of Recommendations for Additional Imaging From Free-Text Impressions of Radiology Reports.,2025,AJR. American journal of roentgenology,,,,"BACKGROUND. Automated extraction of actionable details of recommendations for additional imaging (RAIs) from radiology reports could facilitate tracking and timely completion of clinically necessary RAIs and thereby potentially reduce diagnostic delays. OBJECTIVE. The purpose of the study was to assess the performance of large language models (LLMs) in extracting actionable details of RAIs from radiology reports. METHODS. This retrospective single-center study evaluated reports of diagnostic radiology examinations performed across modalities and care settings within five subspecialties (abdominal imaging, musculoskeletal imaging, neuroradiology, nuclear medicine, thoracic imaging) in August 2023. Of reports identified by a previously validated natural language processing algorithm to contain an RAI, 250 were randomly selected; 231 of these reports were confirmed to contain an RAI on manual review and formed the study sample. Twenty-five reports were used to engineer a prompt instructing an LLM, when inputted in a report impression containing an RAI, to extract details about the modality, body part, time frame, and rationale of the RAI; the remaining 206 reports were used for testing the prompt in combination with GPT-3.5 and GPT-4. A 4th-year medical student and radiologist from the relevant subspecialty independently classified the LLM outputs as correct versus incorrect for extracting the four actionable details of RAIs in comparison with the report impressions; a third reviewer assisted in resolving discrepancies. Extraction accuracy was summarized and compared between LLMs using consensus assessments. RESULTS. For GPT-3.5 and GPT-4, the two reviewers agreed about classification of LLM output as correct versus incorrect with respect to report impressions for 95.6% and 94.2% for RAI modality, 89.3% and 88.3% for RAI body part, 96.1% and 95.1% for RAI time frame, and 89.8% and 88.8% for RAI rationale, respectively. GPT-4 was more accurate than GPT-3.5 in extracting RAI modality (94.2% [194/206] vs 85.4% [176/206], p < .001), RAI body part (86.9% [179/206] vs 77.2% [159/206], p = .004), and RAI time frame (99.0% [204/206] vs 95.6% [197/206], p = .02). Both LLMs had accuracy of 91.7% (189/206) for extracting RAI rationale. CONCLUSION. LLMs were used to extract actionable details of RAIs from free-text impression sections of radiology reports; GPT-4 outperformed GPT-3.5. CLINICAL IMPACT. The technique could represent an innovative method to facilitate timely completion of clinically necessary radiologist recommendations.",Li KW; Lacson R; Guenette JP; DiPiro PJ; Burk KS; Kapoor N; Salah F; Khorasani R
36946005,Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma.,2023,Clinical and molecular hepatology,,,,"BACKGROUND/AIMS: Patients with cirrhosis and hepatocellular carcinoma (HCC) require extensive and personalized care to improve outcomes. ChatGPT (Generative Pre-trained Transformer), a large language model, holds the potential to provide professional yet patient-friendly support. We aimed to examine the accuracy and reproducibility of ChatGPT in answering questions regarding knowledge, management, and emotional support for cirrhosis and HCC. METHODS: ChatGPT's responses to 164 questions were independently graded by two transplant hepatologists and resolved by a third reviewer. The performance of ChatGPT was also assessed using two published questionnaires and 26 questions formulated from the quality measures of cirrhosis management. Finally, its emotional support capacity was tested. RESULTS: We showed that ChatGPT regurgitated extensive knowledge of cirrhosis (79.1% correct) and HCC (74.0% correct), but only small proportions (47.3% in cirrhosis, 41.1% in HCC) were labeled as comprehensive. The performance was better in basic knowledge, lifestyle, and treatment than in the domains of diagnosis and preventive medicine. For the quality measures, the model answered 76.9% of questions correctly but failed to specify decision-making cut-offs and treatment durations. ChatGPT lacked knowledge of regional guidelines variations, such as HCC screening criteria. However, it provided practical and multifaceted advice to patients and caregivers regarding the next steps and adjusting to a new diagnosis. CONCLUSION: We analyzed the areas of robustness and limitations of ChatGPT's responses on the management of cirrhosis and HCC and relevant emotional support. ChatGPT may have a role as an adjunct informational tool for patients and physicians to improve outcomes.",Yeo YH; Samaan JS; Ng WH; Ting PS; Trivedi H; Vipani A; Ayoub W; Yang JD; Liran O; Spiegel B; Kuo A
38509182,"ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources.",2024,"Eye (London, England)",,,,"BACKGROUND/OBJECTIVES: Experimental investigation. Bing Chat (Microsoft) integration with ChatGPT-4 (OpenAI) integration has conferred the capability of accessing online data past 2021. We investigate its performance against ChatGPT-3.5 on a multiple-choice question ophthalmology exam. SUBJECTS/METHODS: In August 2023, ChatGPT-3.5 and Bing Chat were evaluated against 913 questions derived from the Academy's Basic and Clinical Science Collection collection. For each response, the sub-topic, performance, Simple Measure of Gobbledygook readability score (measuring years of required education to understand a given passage), and cited resources were collected. The primary outcomes were the comparative scores between models, and qualitatively, the resources referenced by Bing Chat. Secondary outcomes included performance stratified by response readability, question type (explicit or situational), and BCSC sub-topic. RESULTS: Across 913 questions, ChatGPT-3.5 scored 59.69% [95% CI 56.45,62.94] while Bing Chat scored 73.60% [95% CI 70.69,76.52]. Both models performed significantly better in explicit than clinical reasoning questions. Both models performed best on general medicine questions than ophthalmology subsections. Bing Chat referenced 927 online entities and provided at-least one citation to 836 of the 913 questions. The use of more reliable (peer-reviewed) sources was associated with higher likelihood of correct response. The most-cited resources were eyewiki.aao.org, aao.org, wikipedia.org, and ncbi.nlm.nih.gov. Bing Chat showed significantly better readability than ChatGPT-3.5, averaging a reading level of grade 11.4 [95% CI 7.14, 15.7] versus 12.4 [95% CI 8.77, 16.1], respectively (p-value < 0.0001, rho = 0.25). CONCLUSIONS: The online access, improved readability, and citation feature of Bing Chat confers additional utility for ophthalmology learners. We recommend critical appraisal of cited sources during response interpretation.",Tao BK; Hua N; Milkovich J; Micieli JA
40218132,Can ChatGPT Help General Practitioners Become Acquainted with Conversations About Dying? A Simulated Single-Case Study.,2025,"Healthcare (Basel, Switzerland)",,,,"Background/Objectives: General practitioners (GPs) should be able to initiate open conversations about death with their patients. It is hypothesized that a change in attitude regarding discussions of death with patients may be accomplished through doctors' training, particularly with the use of artificial intelligence (AI). This study aimed to evaluate whether OpenAI's ChatGPT can simulate a medical communication scenario involving a GP consulting a patient who is dying at home. Methods: ChatGPT-4o was prompted to generate a medical communication scenario in which a GP consults with a patient dying at home. ChatGPT-4o was instructed to follow seven predefined steps from an evidence-based model for discussing dying with patients and their family caregivers. The output was assessed by comparing each step of the conversation to the model's recommendations. Results: ChatGPT-4o created a seven-step scenario based on the initial prompt and addressed almost all intended recommendations. However, two points were not addressed: ChatGPT-4o did not use terms like ""dying"", ""passing away"", or ""death"", although the concept was present from the beginning of the conversation with the patient. Additionally, cultural and religious backgrounds related to dying and death were not discussed. Conclusions: ChatGPT-4o can be used as a supportive tool for introducing GPs to the language and sequencing of speech acts that form a successful foundation for meaningful, sensitive conversations about dying, without requiring advanced technical resources and without placing any burden on real patients.",Prazeres F
40004661,Language Artificial Intelligence Models as Pioneers in Diagnostic Medicine? A Retrospective Analysis on Real-Time Patients.,2025,Journal of clinical medicine,,,,"Background/Objectives: GPT-3.5 and GPT-4 has shown promise in assisting healthcare professionals with clinical questions. However, their performance in real-time clinical scenarios remains underexplored. This study aims to evaluate their precision and reliability compared to board-certified emergency department attendings, highlighting their potential in improving patient care. We hypothesized that board-certified emergency department attendings at Maimonides Medical Center exhibit higher accuracy and reliability than GPT-3.5 and GPT-4 in generating differentials based on history and physical examination for patients presenting to the emergency department. Methods: Real-time patient data from Maimonides Medical Center's emergency department, collected from 1 January 2023 to 1 March 2023 were analyzed. Demographic details, symptoms, medical history, and discharge diagnoses recorded by emergency room attendings were examined. AI algorithms (ChatGPT-3.5 and GPT-4) generated differential diagnoses, which were compared with those by attending physicians. Accuracy was determined by comparing each rater's diagnoses with the gold standard discharge diagnosis, calculating the proportion of correctly identified cases. Precision was assessed using Cohen's kappa coefficient and Intraclass Correlation Coefficient to measure agreement between raters. Results: Mean age of patients was 49.12 years, with 57.3% males and 42.7% females. Chief complaints included fever/sepsis (24.7%), gastrointestinal issues (17.7%), and cardiovascular problems (16.4%). Diagnostic accuracy against discharge diagnoses was highest for ChatGPT-4 (85.5%), followed by ChatGPT-3.5 (84.6%) and ED attendings (83%). Cohen's kappa demonstrated moderate agreement (0.7) between AI models, with lower agreement observed for ED attendings. Stratified analysis revealed higher accuracy for gastrointestinal complaints with Chat GPT-4 (87.5%) and cardiovascular complaints with Chat GPT-3.5 (81.34%). Conclusions: Our study demonstrates that Chat GPT-4 and GPT-3.5 exhibit comparable diagnostic accuracy to board-certified emergency department attendings, highlighting their potential to aid decision-making in dynamic clinical settings. The stratified analysis revealed comparable reliability and precision of the AI chat bots for cardiovascular complaints which represents a significant proportion of the high-risk patients presenting to the emergency department and provided targeted insights into rater performance within specific medical domains. This study contributes to integrating AI models into medical practice, enhancing efficiency and effectiveness in clinical decision-making. Further research is warranted to explore broader applications of AI in healthcare.",Naeem A; Khan O; Baqir SM; Jana K; Shankar P; Kaur A; Zaaya M; Sajid F; Mohsin F; Boadla MR; Oo A; Wong V; Noor M; Sandhu SPS; Slobodyanuk K; Shetty V; Tokayer AZ
39201195,Assessment Study of ChatGPT-3.5's Performance on the Final Polish Medical Examination: Accuracy in Answering 980 Questions.,2024,"Healthcare (Basel, Switzerland)",,,,"BACKGROUND/OBJECTIVES: The use of artificial intelligence (AI) in education is dynamically growing, and models such as ChatGPT show potential in enhancing medical education. In Poland, to obtain a medical diploma, candidates must pass the Medical Final Examination, which consists of 200 questions with one correct answer per question, is administered in Polish, and assesses students' comprehensive medical knowledge and readiness for clinical practice. The aim of this study was to determine how ChatGPT-3.5 handles questions included in this exam. METHODS: This study considered 980 questions from five examination sessions of the Medical Final Examination conducted by the Medical Examination Center in the years 2022-2024. The analysis included the field of medicine, the difficulty index of the questions, and their type, namely theoretical versus case-study questions. RESULTS: The average correct answer rate achieved by ChatGPT for the five examination sessions hovered around 60% and was lower (p < 0.001) than the average score achieved by the examinees. The lowest percentage of correct answers was in hematology (42.1%), while the highest was in endocrinology (78.6%). The difficulty index of the questions showed a statistically significant correlation with the correctness of the answers (p = 0.04). Questions for which ChatGPT-3.5 provided incorrect answers had a lower (p < 0.001) percentage of correct responses. The type of questions analyzed did not significantly affect the correctness of the answers (p = 0.46). CONCLUSIONS: This study indicates that ChatGPT-3.5 can be an effective tool for assisting in passing the final medical exam, but the results should be interpreted cautiously. It is recommended to further verify the correctness of the answers using various AI tools.",Siebielec J; Ordak M; Oskroba A; Dworakowska A; Bujalska-Zadrozny M
39035269,Evaluating GPT-4V's performance in the Japanese national dental examination: A challenge explored.,2024,Journal of dental sciences,,,,"BACKGROUND/PURPOSE: Rapid advancements in AI technology have led to significant interest in its application across various fields, including medicine and dentistry. This study aimed to assess the capabilities of ChatGPT-4V with image recognition in answering image-based questions from the Japanese National Dental Examination (JNDE) to explore its potential as an educational support tool for dental students. MATERIALS AND METHODS: The dataset used questions from the JNDE, which was conducted in January 2023, with a focus on image-related queries. ChatGPT-4V was utilized, and standardized prompts, question texts, and images were input. Data and statistical analyses were conducted using Qlik Sense(R) and GraphPad Prism. RESULTS: The overall correct response rate of ChatGPT-4V for image-based JNDE questions was 35.0 %. The correct response rates were 57.1 % for compulsory questions, 43.6 % for general questions, and 28.6 % for clinical practical questions. In specialties like Dental Anesthesiology and Endodontics, ChatGPT-4V achieved correct response rates above 70 %, while response rates for Orthodontics and Oral Surgery were lower. A higher number of images in questions was correlated with lower accuracy, suggesting an impact of the number of images on correct and incorrect responses. CONCLUSION: While innovative, ChatGPT-4V's image recognition feature exhibited limitations, especially in handling image-intensive and complex clinical practical questions, and is not yet fully suitable as an educational support tool for dental students at its current stage. Further technological refinement and re-evaluation with a broader dataset are recommended.",Morishita M; Fukuda H; Muraoka K; Nakamura T; Hayashi M; Yoshioka I; Ono K; Awano S
38771247,The Use of Generative AI for Scientific Literature Searches for Systematic Reviews: ChatGPT and Microsoft Bing AI Performance Evaluation.,2024,JMIR medical informatics,,,,"BACKGROUND: A large language model is a type of artificial intelligence (AI) model that opens up great possibilities for health care practice, research, and education, although scholars have emphasized the need to proactively address the issue of unvalidated and inaccurate information regarding its use. One of the best-known large language models is ChatGPT (OpenAI). It is believed to be of great help to medical research, as it facilitates more efficient data set analysis, code generation, and literature review, allowing researchers to focus on experimental design as well as drug discovery and development. OBJECTIVE: This study aims to explore the potential of ChatGPT as a real-time literature search tool for systematic reviews and clinical decision support systems, to enhance their efficiency and accuracy in health care settings. METHODS: The search results of a published systematic review by human experts on the treatment of Peyronie disease were selected as a benchmark, and the literature search formula of the study was applied to ChatGPT and Microsoft Bing AI as a comparison to human researchers. Peyronie disease typically presents with discomfort, curvature, or deformity of the penis in association with palpable plaques and erectile dysfunction. To evaluate the quality of individual studies derived from AI answers, we created a structured rating system based on bibliographic information related to the publications. We classified its answers into 4 grades if the title existed: A, B, C, and F. No grade was given for a fake title or no answer. RESULTS: From ChatGPT, 7 (0.5%) out of 1287 identified studies were directly relevant, whereas Bing AI resulted in 19 (40%) relevant studies out of 48, compared to the human benchmark of 24 studies. In the qualitative evaluation, ChatGPT had 7 grade A, 18 grade B, 167 grade C, and 211 grade F studies, and Bing AI had 19 grade A and 28 grade C studies. CONCLUSIONS: This is the first study to compare AI and conventional human systematic review methods as a real-time literature collection tool for evidence-based medicine. The results suggest that the use of ChatGPT as a tool for real-time evidence generation is not yet accurate and feasible. Therefore, researchers should be cautious about using such AI. The limitations of this study using the generative pre-trained transformer model are that the search for research topics was not diverse and that it did not prevent the hallucination of generative AI. However, this study will serve as a standard for future studies by providing an index to verify the reliability and consistency of generative AI from a user's point of view. If the reliability and consistency of AI literature search services are verified, then the use of these technologies will help medical research greatly.",Gwon YN; Kim JH; Chung HS; Jung EJ; Chun J; Lee S; Shim SR
38898239,Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients.,2024,Aesthetic plastic surgery,,,,"BACKGROUND: Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries. METHODS: This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists. RESULTS: ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries. CONCLUSION: ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Lim B; Seth I; Cuomo R; Kenney PS; Ross RJ; Sofiadellis F; Pentangelo P; Ceccaroni A; Alfano C; Rozen WM
37509379,Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0.,2023,Cancers,,,,"BACKGROUND: Accurate and efficient triage is crucial for prioritizing care and managing resources in emergency rooms. This study investigates the effectiveness of ChatGPT, an advanced artificial intelligence system, in assisting health providers with decision-making for patients presenting with metastatic prostate cancer, focusing on the potential to improve both patient outcomes and resource allocation. METHODS: Clinical data from patients with metastatic prostate cancer who presented to the emergency room between 1 May 2022 and 30 April 2023 were retrospectively collected. The primary outcome was the sensitivity and specificity of ChatGPT in determining whether a patient required admission or discharge. The secondary outcomes included the agreement between ChatGPT and emergency medicine physicians, the comprehensiveness of diagnoses, the accuracy of treatment plans proposed by both parties, and the length of medical decision making. RESULTS: Of the 147 patients screened, 56 met the inclusion criteria. ChatGPT had a sensitivity of 95.7% in determining admission and a specificity of 18.2% in discharging patients. In 87.5% of cases, ChatGPT made the same primary diagnoses as physicians, with more accurate terminology use (42.9% vs. 21.4%, p = 0.02) and more comprehensive diagnostic lists (median number of diagnoses: 3 vs. 2, p < 0.001). Emergency Severity Index scores calculated by ChatGPT were not associated with admission (p = 0.12), hospital stay length (p = 0.91) or ICU admission (p = 0.54). Despite shorter mean word count (169 +/- 66 vs. 272 +/- 105, p < 0.001), ChatGPT was more likely to give additional treatment recommendations than physicians (94.3% vs. 73.5%, p < 0.001). CONCLUSIONS: Our hypothesis-generating data demonstrated that ChatGPT is associated with a high sensitivity in determining the admission of patients with metastatic prostate cancer in the emergency room. It also provides accurate and comprehensive diagnoses. These findings suggest that ChatGPT has the potential to assist health providers in improving patient triage in emergency settings, and may enhance both efficiency and quality of care provided by the physicians.",Gebrael G; Sahu KK; Chigarira B; Tripathi N; Mathew Thomas V; Sayegh N; Maughan BL; Agarwal N; Swami U; Li H
40206348,Artificial intelligence-large language models (AI-LLMs) for reliable and accurate cardiotocography (CTG) interpretation in obstetric practice.,2025,Computational and structural biotechnology journal,,,,"BACKGROUND: Accurate cardiotocography (CTG) interpretation is vital for the monitoring of fetal well-being during pregnancy and labor. Advanced artificial intelligence (AI) tools such as AI-large language models (AI-LLMs) may enhance the accuracy of CTG interpretation, but their potential has not been extensively evaluated. OBJECTIVE: This study aimed to assess the performance of three AI-LLMs (ChatGPT-4o, Gemini Advanced, and Copilot) in CTG image interpretation, compare their results to those of junior (JHDs) and senior human doctors (SHDs), and evaluate their reliability in clinical decision-making. STUDY DESIGN: Seven CTG images were interpreted by the three AI-LLMs, five SHDs, and five JHDs, with the evaluations scored by five blinded maternal-fetal medicine experts using a Likert scale for five parameters (relevance, clarity, depth, focus, and coherence). The homogeneity of the expert ratings and group performances were statistically compared. RESULTS: ChatGPT-4o scored 77.86, outperforming the Gemini Advanced (57.14), Copilot (47.29), and JHDs (61.57). Its performance closely approached that of the SHDs (80.43), with no statistically significant difference between the two (p > 0.05). ChatGPT-4o excelled in the depth parameter and was only marginally inferior to the SHDs regarding the other parameters. CONCLUSION: ChatGPT-4o demonstrated superior performance among the AI-LLMs, surpassed JHDs in CTG interpretation, and closely matched the performance level of SHDs. AI-LLMs, particularly ChatGPT-4o, are promising tools for assisting obstetricians, improving diagnostic accuracy, and enhancing obstetric patient care.",Gumilar KE; Wardhana MP; Akbar MIA; Putra AS; Banjarnahor DPP; Mulyana RS; Fatati I; Yu ZY; Hsu YC; Dachlan EG; Lu CH; Liao LN; Tan M
39848078,Large language models vs human for classifying clinical documents.,2025,International journal of medical informatics,,,,"BACKGROUND: Accurate classification of medical records is crucial for clinical documentation, particularly when using the 10th revision of the International Classification of Diseases (ICD-10) coding system. The use of machine learning algorithms and Systematized Nomenclature of Medicine (SNOMED) mapping has shown promise in performing these classifications. However, challenges remain, particularly in reducing false negatives, where certain diagnoses are not correctly identified by either approach. OBJECTIVE: This study explores the potential of leveraging advanced large language models to improve the accuracy of ICD-10 classifications in challenging cases of medical records where machine learning and SNOMED mapping fail. METHODS: We evaluated the performance of ChatGPT 3.5 and ChatGPT 4 in classifying ICD-10 codes from discharge summaries within selected records of the Medical Information Mart for Intensive Care (MIMIC) IV dataset. These records comprised 802 discharge summaries identified as false negatives by both machine learning and SNOMED mapping methods, showing their challenging case. Each summary was assessed by ChatGPT 3.5 and 4 using a classification prompt, and the results were compared to human coder evaluations. Five human coders, with a combined experience of over 30 years, independently classified a stratified sample of 100 summaries to validate ChatGPT's performance. RESULTS: ChatGPT 4 demonstrated significantly improved consistency over ChatGPT 3.5, with matching results between runs ranging from 86% to 89%, compared to 57% to 67% for ChatGPT 3.5. The classification accuracy of ChatGPT 4 was variable across different ICD-10 codes. Overall, human coders performed better than ChatGPT. However, ChatGPT matched the median performance of human coders, achieving an accuracy rate of 22%. CONCLUSION: This study underscores the potential of integrating advanced language models with clinical coding processes to improve documentation accuracy. ChatGPT 4 demonstrated improved consistency and comparable performance to median human coders, achieving 22% accuracy in challenging cases. Combining ChatGPT with methods like SNOMED mapping could further enhance clinical coding accuracy, particularly for complex scenarios.",Mustafa A; Naseem U; Rahimi Azghadi M
38989848,Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.,2024,JMIR medical education,,,,"BACKGROUND: Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI's GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse. OBJECTIVE: This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses. METHODS: We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio. RESULTS: GPT-4 and human experts displayed comparable efficacy in medical accuracy (""GPT-4 is better"" at 132/251, 52.6% vs ""Human expert is better"" at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience. CONCLUSIONS: GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions.",Jo E; Song S; Kim JH; Lim S; Kim JH; Cha JJ; Kim YM; Joo HJ
40264428,Efficient spheroid morphology assessment with a ChatGPT data analyst: implications for cell therapy.,2025,BioTechniques,,,,"BACKGROUND: Adipose-derived stem cells (ADSCs) exhibit promising potential for the treatment of various diseases, including osteoarthritis. Spheroids derived from ADSCs are a viable treatment option with enhanced anti-inflammatory effects and tissue repair capabilities. OBJECTIVE: SphereRing((R)) is a rotating donut-shaped tube that efficiently produces large quantities of spheroids. However, accurately measuring spheroid size for spheroid quality assessment is challenging. This study aimed to develop an automated method for measuring spheroid size using deep learning through the ChatGPT Data Analyst for image recognition and processing. METHOD: The area, perimeter, and circularity of spheroids generated with the SphereRing system were analyzed using ChatGPT Data Analyst and ImageJ. Measurement accuracy was validated using Bland-Altman analysis and scatter plot correlation coefficients. RESULTS: ChatGPT Data Analyst was consistent with ImageJ for all parameters. Bland-Altman plots demonstrated strong agreement; most data points were within the 95% limits. CONCLUSION: The ChatGPT Data Analyst provides a reliable and efficient alternative for assessing spheroid quality. This method reduces human error and improves reproducibility to enhance spheroid quality control. Thus, this method has potential applications in regenerative medicine.",Sakamoto T; Koma H; Kuwano A; Horie T; Fuku A; Kitajima H; Nakamura Y; Tanida I; Nakade Y; Hirata H; Tachi Y; Sunami H; Sakamoto D; Yamada S; Yamamoto N; Shimizu Y; Ishigaki Y; Ichiseki T; Kaneuji A; Osawa S; Kawahara N
38420977,Advancing Medical Practice with Artificial Intelligence: ChatGPT in Healthcare.,2024,The Israel Medical Association journal : IMAJ,,,,"BACKGROUND: Advancements in artificial intelligence (AI) and natural language processing (NLP) have led to the development of language models such as ChatGPT. These models have the potential to transform healthcare and medical research. However, understanding their applications and limitations is essential. OBJECTIVES: To present a view of ChatGPT research and to critically assess ChatGPT's role in medical writing and clinical environments. METHODS: We performed a literature review via the PubMed search engine from 20 November 2022, to 23 April 2023. The search terms included ChatGPT, OpenAI, and large language models. We included studies that focused on ChatGPT, explored its use or implications in medicine, and were original research articles. The selected studies were analyzed considering study design, NLP tasks, main findings, and limitations. RESULTS: Our study included 27 articles that examined ChatGPT's performance in various tasks and medical fields. These studies covered knowledge assessment, writing, and analysis tasks. While ChatGPT was found to be useful in tasks such as generating research ideas, aiding clinical reasoning, and streamlining workflows, limitations were also identified. These limitations included inaccuracies, inconsistencies, fictitious information, and limited knowledge, highlighting the need for further improvements. CONCLUSIONS: The review underscores ChatGPT's potential in various medical applications. Yet, it also points to limitations that require careful human oversight and responsible use to improve patient care, education, and decision-making.",Tessler I; Wolfovitz A; Livneh N; Gecel NA; Sorin V; Barash Y; Konen E; Klang E
38391252,Is ChatGPT knowledgeable of acute coronary syndromes and pertinent European Society of Cardiology Guidelines?,2024,Minerva cardiology and angiology,,,,"BACKGROUND: Advancements in artificial intelligence are being seen in multiple fields, including medicine, and this trend is likely to continue going forward. To analyze the accuracy and reproducibility of ChatGPT answers about acute coronary syndromes (ACS). METHODS: The questions asked to ChatGPT were prepared in two categories. A list of frequently asked questions (FAQs) created from inquiries asked by the public and while preparing the scientific question list, 2023 European Society of Cardiology (ESC) Guidelines for the management of ACS and ESC Clinical Practice Guidelines were used. Accuracy and reproducibility of ChatGPT responses about ACS were evaluated by two cardiologists with ten years of experience using Global Quality Score (GQS). RESULTS: Eventually, 72 FAQs related to ACS met the study inclusion criteria. In total, 65 (90.3%) ChatGPT answers scored GQS 5, which indicated highest accuracy and proficiency. None of the ChatGPT responses to FAQs about ACS scored GQS 1. In addition, highest accuracy and reliability of ChatGPT answers was obtained for the prevention and lifestyle section with GQS 5 for 19 (95%) answers, and GQS 4 for 1 (5%) answer. In contrast, accuracy and proficiency of ChatGPT answers were lowest for the treatment and management section. Moreover, 68 (88.3%) ChatGPT responses for guideline based questions scored GQS 5. Reproducibility of ChatGPT answers was 94.4% for FAQs and 90.9% for ESC guidelines questions. CONCLUSIONS: This study shows for the first time that ChatGPT can give accurate and sufficient responses to more than 90% of FAQs about ACS. In addition, proficiency and correctness of ChatGPT answers about questions depending on ESC guidelines was also substantial.",Gurbuz DC; Varis E
38907651,Evaluation of a Large Language Model's Ability to Assist in an Orthopedic Hand Clinic.,2024,"Hand (New York, N.Y.)",,,,"BACKGROUND: Advancements in artificial intelligence technology, such as OpenAI's large language model, ChatGPT, could transform medicine through applications in a clinical setting. This study aimed to assess the utility of ChatGPT as a clinical assistant in an orthopedic hand clinic. METHODS: Nine clinical vignettes, describing various common and uncommon hand pathologies, were constructed and reviewed by 4 fellowship-trained orthopedic hand surgeons and an orthopedic resident. ChatGPT was given these vignettes and asked to generate a differential diagnosis, potential workup plan, and provide treatment options for its top differential. Responses were graded for accuracy and the overall utility scored on a 5-point Likert scale. RESULTS: The diagnostic accuracy of ChatGPT was 7 out of 9 cases, indicating an overall accuracy rate of 78%. ChatGPT was less reliable with more complex pathologies and failed to identify an intentionally incorrect presentation. ChatGPT received a score of 3.8 +/- 1.4 for correct diagnosis, 3.4 +/- 1.4 for helpfulness in guiding patient management, 4.1 +/- 1.0 for appropriate workup for the actual diagnosis, 4.3 +/- 0.8 for an appropriate recommended treatment plan for the diagnosis, and 4.4 +/- 0.8 for the helpfulness of treatment options in managing patients. CONCLUSION: ChatGPT was successful in diagnosing most of the conditions; however, the overall utility of its advice was variable. While it performed well in recommending treatments, it faced difficulties in providing appropriate diagnoses for uncommon pathologies. In addition, it failed to identify an obvious error in presenting pathology.",Kotzur T; Singh A; Parker J; Peterson B; Sager B; Rose R; Corley F; Brady C
39817149,ChatGPT is an Unreliable Source of Peer-Reviewed Information for Common Total Knee and Hip Arthroplasty Patient Questions.,2025,Advances in orthopedics,,,,"Background: Advances in artificial intelligence (AI), machine learning, and publicly accessible language model tools such as ChatGPT-3.5 continue to shape the landscape of modern medicine and patient education. ChatGPT's open access (OA), instant, human-sounding interface capable of carrying discussion on myriad topics makes it a potentially useful resource for patients seeking medical advice. As it pertains to orthopedic surgery, ChatGPT may become a source to answer common preoperative questions regarding total knee arthroplasty (TKA) and total hip arthroplasty (THA). Since ChatGPT can utilize the peer-reviewed literature to source its responses, this study seeks to characterize the validity of its responses to common TKA and THA questions and characterize the peer-reviewed literature that it uses to formulate its responses. Methods: Preoperative TKA and THA questions were formulated by fellowship-trained adult reconstruction surgeons based on common questions posed by patients in the clinical setting. Questions were inputted into ChatGPT with the initial request of using solely the peer-reviewed literature to generate its responses. The validity of each response was rated on a Likert scale by the fellowship-trained surgeons, and the sources utilized were characterized in terms of accuracy of comparison to existing publications, publication date, study design, level of evidence, journal of publication, journal impact factor based on the clarivate analytics factor tool, journal OA status, and whether the journal is based in the United States. Results: A total of 109 sources were cited by ChatGPT in its answers to 17 questions regarding TKA procedures and 16 THA procedures. Thirty-nine sources (36%) were deemed accurate or able to be directly traced to an existing publication. Of these, seven (18%) were identified as duplicates, yielding a total of 32 unique sources that were identified as accurate and further characterized. The most common characteristics of these sources included dates of publication between 2011 and 2015 (10), publication in The Journal of Bone and Joint Surgery (13), journal impact factors between 5.1 and 10.0 (17), internationally based journals (17), and journals that are not OA (28). The most common study designs were retrospective cohort studies and case series (seven each). The level of evidence was broadly distributed between Levels I, III, and IV (seven each). The averages for the Likert scales for medical accuracy and completeness were 4.4/6 and 1.92/3, respectively. Conclusions: Investigation into ChatGPT's response quality and use of peer-reviewed sources when prompted with archetypal pre-TKA and pre-THA questions found ChatGPT to provide mostly reliable responses based on fellowship-trained orthopedic surgeon review of 4.4/6 for accuracy and 1.92/3 for completeness despite a 64.22% rate of citing inaccurate references. This study suggests that until ChatGPT is proven to be a reliable source of valid information and references, patients must exercise extreme caution in directing their pre-TKA and THA questions to this medium.",Schwartzman JD; Shaath MK; Kerr MS; Green CC; Haidukewych GJ
37750052,"Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools.",2023,"Drug, healthcare and patient safety",,,,"BACKGROUND: AI platforms are equipped with advanced ‎algorithms that have the potential to offer a wide range of ‎applications in healthcare services. However, information about the accuracy of AI chatbots against ‎conventional drug-drug interaction tools is limited‎. This study aimed to assess the sensitivity, specificity, and accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard in predicting drug-drug interactions. METHODS: AI-based chatbots (ie, ChatGPT-3.5, ChatGPT-4, Microsoft Bing AI, and Google Bard) were compared for their abilities to detect clinically relevant DDIs for 255 drug pairs. Descriptive statistics, such as specificity, sensitivity, accuracy, negative predictive value (NPV), and positive predictive value (PPV), were calculated for each tool. RESULTS: When a subscription tool was used as a reference, the specificity ranged from a low of 0.372 (ChatGPT-3.5) to a high of 0.769 (Microsoft Bing AI). Also, Microsoft Bing AI had the highest performance with an accuracy score of 0.788, with ChatGPT-3.5 having the lowest accuracy rate of 0.469. There was an overall improvement in performance for all the programs when the reference tool switched to a free DDI source, but still, ChatGPT-3.5 had the lowest specificity (0.392) and accuracy (0.525), and Microsoft Bing AI demonstrated the highest specificity (0.892) and accuracy (0.890). When assessing the consistency of accuracy across two different drug classes, ChatGPT-3.5 and ChatGPT-4 showed the highest ‎variability in accuracy. In addition, ChatGPT-3.5, ChatGPT-4, and Bard exhibited the highest ‎fluctuations in specificity when analyzing two medications belonging to the same drug class. CONCLUSION: Bing AI had the highest accuracy and specificity, outperforming Google's Bard, ChatGPT-3.5, and ChatGPT-4. The findings highlight the significant potential these AI tools hold in transforming patient care. While the current AI platforms evaluated are not without limitations, their ability to quickly analyze potentially significant interactions with good sensitivity suggests a promising step towards improved patient safety.",Al-Ashwal FY; Zawiah M; Gharaibeh L; Abu-Farha R; Bitar AN
37549499,"Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.",2023,International journal of medical informatics,,,,"BACKGROUND: Although chat generative pre-trained transformer (ChatGPT) has made several successful attempts in the medical field, most notably in answering medical questions in English, no studies have evaluated ChatGPT's performance in a Chinese context for a medical task. OBJECTIVE: The aim of this study was to evaluate ChatGPT's ability to understand medical knowledge in Chinese, as well as its potential to serve as an electronic health infrastructure for medical development, by evaluating its performance in medical examinations, records, and education. METHOD: The Chinese (CNMLE) and English (ENMLE) datasets of the China National Medical Licensing Examination and the Chinese dataset (NEEPM) of the China National Entrance Examination for Postgraduate Clinical Medicine Comprehensive Ability were used to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4). We assessed answer accuracy, verbal fluency, and the classification of incorrect responses owing to hallucinations on multiple occasions. In addition, we tested ChatGPT's performance on discharge summaries and group learning in a Chinese context on a small scale. RESULTS: The accuracy of GPT-3.5 in CNMLE, ENMLE, and NEEPM was 56% (56/100), 76% (76/100), and 62% (62/100), respectively, compared to that of GPT-4, which was of 84% (84/100), 86% (86/100), and 82% (82/100). The verbal fluency of all the ChatGPT responses exceeded 95%. Among the GPT-3.5 incorrect responses, the proportions of open-domain hallucinations were 66 % (29/44), 54 % (14/24), and 63 % (24/38), whereas close-domain hallucinations accounted for 34 % (15/44), 46 % (14/24), and 37 % (14/38), respectively. By contrast, GPT-4 open-domain hallucinations accounted for 56% (9/16), 43% (6/14), and 83% (15/18), while close-domain hallucinations accounted for 44% (7/16), 57% (8/14), and 17% (3/18), respectively. In the discharge summary, ChatGPT demonstrated logical coherence, however GPT-3.5 could not fulfill the quality requirements, while GPT-4 met the qualification of 60% (6/10). In group learning, the verbal fluency and interaction satisfaction with ChatGPT were 100% (10/10). CONCLUSION: ChatGPT based on GPT-4 is at par with Chinese medical practitioners who passed the CNMLE and at the standard required for admission to clinical medical graduate programs in China. The GPT-4 shows promising potential for discharge summarization and group learning. Additionally, it shows high verbal fluency, resulting in a positive human-computer interaction experience. GPT-4 significantly improves multiple capabilities and reduces hallucinations compared to the previous GPT-3.5 model, with a particular leap forward in the Chinese comprehension capability of medical tasks. Artificial intelligence (AI) systems face the challenges of hallucinations, legal risks, and ethical issues. However, we discovered ChatGPT's potential to promote medical development as an electronic health infrastructure, paving the way for Medical AI to become necessary.",Wang H; Wu W; Dou Z; He L; Yang L
39137416,Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study.,2024,JMIR AI,,,,"BACKGROUND: Although uncertainties exist regarding implementation, artificial intelligence-driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy. OBJECTIVE: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy. METHODS: Input templates related to 2 prevalent chronic diseases-type II diabetes and hypertension-were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL). RESULTS: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor's, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor's degree. Conversely, GPT-4's FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels. CONCLUSIONS: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor's degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy.",Spina A; Andalib S; Flores D; Vermani R; Halaseh FF; Nelson AM
38874270,"ChatGPT in Urology: Bridging Knowledge and Practice for Tomorrow's Healthcare, a Comprehensive Review.",2024,Journal of endourology,,,,"Background: Among emerging AI technologies, Chat-Generative Pre-Trained Transformer (ChatGPT) emerges as a notable language model, uniquely developed through artificial intelligence research. Its proven versatility across various domains, from language translation to healthcare data processing, underscores its promise within medical documentation, diagnostics, research, and education. The current comprehensive review aimed to investigate the utility of ChatGPT in urology education and practice and to highlight its potential limitations. Methods: The authors conducted a comprehensive literature review of the use of ChatGPT and its applications in urology education, research, and practice. Through a systematic review of the literature, with a search strategy using databases, such as PubMed and Embase, we analyzed the advantages and limitations of using ChatGPT in urology and evaluated its potential impact. Results: A total of 78 records were eligible for inclusion. The benefits of ChatGPT were frequently cited across various contexts. In educational/academic benefits mentioned in 21 records (87.5%), ChatGPT showed the ability to assist urologists by offering precise information and responding to inquiries derived from patient data analysis, thereby supporting decision making; in 18 records (75%), advantages comprised personalized medicine, predictive capabilities for disease risks and outcomes, streamlining clinical workflows and improved diagnostics. Nevertheless, apprehensions were expressed regarding potential misinformation, underscoring the necessity for human supervision to guarantee patient safety and address ethical concerns. Conclusion: The potential applications of ChatGPT hold the capacity to bring about transformative changes in urology education, research, and practice. AI technology can serve as a useful tool to augment human intelligence; however, it is essential to use it in a responsible and ethical manner.",Solano C; Tarazona N; Angarita GP; Medina AA; Ruiz S; Pedroza VM; Traxer O
39373234,Vaccination hesitancy: agreement between WHO and ChatGPT-4.0 or Gemini Advanced.,2025,Annali di igiene : medicina preventiva e di comunita,,,,"BACKGROUND: An increasing number of individuals use online Artificial Intelligence (AI) - based chatbots to retrieve information on health-related topics. This study aims to evaluate the accuracy in answering vaccine-related answers of the currently most commonly used, advanced chatbots - ChatGPT-4.0 and Google Gemini Advanced. METHODS: We compared the answers provided by the World Health Organization (WHO) to 38 open questions on vaccination myths and misconception, with the answers created by ChatGPT-4.0 and Gemini Advanced. Responses were considered as ""appropriate"", if the information provided was coherent and not in contrast to current WHO recommendations or to drug regulatory indications. RESULTS AND CONCLUSIONS: The rate of agreement between WHO answers and Chat-GPT-4.0 or Gemini Advanced was very high, as both provided 36 (94.7%) appropriate responses. The few discrepancies between WHO and AI-chatbots answers could not be considered ""harmful"", and both chatbots often invited the user to check reliable sources, such as CDC or the WHO websites, or to contact a local healthcare professional. In their current versions, both AI-chatbots may already be powerful instrument to support the traditional communication tools in primary prevention, with the potential to improve health literacy, medication adherence, and vaccine hesitancy and concerns. Given the rapid evolution of AI-based systems, further studies are strongly needed to monitor their accuracy and reliability over time.",Fiore M; Bianconi A; Acuti Martellucci C; Rosso A; Zauli E; Flacco ME; Manzoli L
38065864,Evaluating cluster analysis techniques in ChatGPT versus R-language with visualizations of author collaborations and keyword cooccurrences on articles in the Journal of Medicine (Baltimore) 2023: Bibliometric analysis.,2023,Medicine,,,,"BACKGROUND: Analyses of author collaborations and keyword co-occurrences are frequently used in bibliographic research. However, no studies have introduced a straightforward yet effective approach, such as utilizing ChatGPT with Code Interpreter (ChatGPT_CI) or the R language, for creating cluster-oriented networks. This research aims to compare cluster analysis methods in ChatGPT_CI and R, visualize country-specific author collaborations, and then demonstrate the most effective approach. METHODS: The research focused on articles and review pieces from Medicine (Baltimore) published in 2023. By August 20, 2023, we had gathered metadata for 1976 articles using the Web of Science core collections. The efficiency and effectiveness of cluster displays between ChatGPT_CI and R were compared by evaluating their time consumption. The best method was then employed to present a series of visualizations of country-specific author collaborations, rooted in social network and cluster analyses. Visualization techniques incorporating network charts, chord diagrams, circle bar plots, circle packing plots, heat dendrograms, dendrograms, and word clouds were demonstrated. We further highlighted the research profiles of 2 prolific authors using timeline visuals. RESULTS: The research findings include that (1) the most active contributors were China, Nanjing Medical University (China), the Medical School Department, and Dr Chou from Taiwan when considering countries, institutions, departments, and individual authors, respectively; (2) the highest cited articles originated from Medicine (Baltimore) accounting for 4.53%: New England Journal of Medicine, PLOS ONE, LANCET, and The Journal of the American Medical Association, with respective contributions of 3.25%, 2.7%, 2.52%, and 1.54%; (3) visual cluster analysis in R proved to be more efficient and effective than ChatGPT_CI, reducing the time taken from 1 hour to just 3 minutes; (4) 7 cluster-focused networks were crafted using R on a custom platform; and (5) the research trajectories of 2 prominent authors (Dr Brin from the United States and Dr Chow from Taiwan) and articles themes in Medicine 2023 were depicted using timeline visuals. CONCLUSIONS: This research highlighted the efficient and effective methods for conducting cluster analyses of author collaborations using R. For future related studies, such as keyword co-occurrence analysis, R is recommended as a viable alternative for bibliographic research.",Cheng YZ; Lai TH; Chien TW; Chou W
38182023,Do ChatGPT and Google differ in answers to commonly asked patient questions regarding total shoulder and total elbow arthroplasty?,2024,Journal of shoulder and elbow surgery,,,,"BACKGROUND: Artificial intelligence (AI) and large language models (LLMs) offer a new potential resource for patient education. The answers by Chat Generative Pre-Trained Transformer (ChatGPT), a LLM AI text bot, to frequently asked questions (FAQs) were compared to answers provided by a contemporary Google search to determine the reliability of information provided by these sources for patient education in upper extremity arthroplasty. METHODS: ""Total shoulder arthroplasty"" (TSA) and ""total elbow arthroplasty"" (TEA) were entered into Google Search and ChatGPT 3.0 to determine the ten most FAQs. On Google, the FAQs were obtained through the ""people also ask"" section, while ChatGPT was asked to provide the ten most FAQs. Each question, answer, and reference(s) cited were recorded. A modified version of the Rothwell system was used to categorize questions into 10 subtopics: special activities, timeline of recovery, restrictions, technical details, cost, indications/management, risks and complications, pain, longevity, and evaluation of surgery. Each reference was categorized into the following groups: commercial, academic, medical practice, single surgeon personal, or social media. Questions for TSA and TEA were combined for analysis and compared between Google and ChatGPT with a 2 sample Z-test for proportions. RESULTS: Overall, most questions were related to procedural indications or management (17.5%). There were no significant differences between Google and ChatGPT between question categories. The majority of references were from academic websites (65%). ChatGPT produced a greater number of academic references compared to Google (80% vs. 50%; P = .047), while Google more commonly provided medical practice references (25% vs. 0%; P = .017). CONCLUSION: In conjunction with patient-physician discussions, AI LLMs may provide a reliable resource for patients. By providing information based on academic references, these tools have the potential to improve health literacy and improved shared decision making for patients searching for information about TSA and TEA. CLINICAL SIGNIFICANCE: With the rising prevalence of AI programs, it is essential to understand how these applications affect patient education in medicine.",Tharakan S; Klein B; Bartlett L; Atlas A; Parada SA; Cohn RM
39592492,Performance of Artificial Intelligence Chatbots in Answering Clinical Questions on Japanese Practical Guidelines for Implant-based Breast Reconstruction.,2025,Aesthetic plastic surgery,,,,"BACKGROUND: Artificial intelligence (AI) chatbots, including ChatGPT-4 (GPT-4) and Grok-1 (Grok), have been shown to be potentially useful in several medical fields, but have not been examined in plastic and aesthetic surgery. The aim of this study is to evaluate the responses of these AI chatbots for clinical questions (CQs) related to the guidelines for implant-based breast reconstruction (IBBR) published by the Japan Society of Plastic and Reconstructive Surgery (JSPRS) in 2021. METHODS: CQs in the JSPRS guidelines were used as question sources. Responses from two AI chatbots, GPT-4 and Grok, were evaluated for accuracy, informativeness, and readability by five Japanese Board-certified breast reconstruction specialists and five Japanese clinical fellows of plastic surgery. RESULTS: GPT-4 outperformed Grok significantly in terms of accuracy (p < 0.001), informativeness (p < 0.001), and readability (p < 0.001) when evaluated by plastic surgery fellows. Compared to the original guidelines, Grok scored significantly lower in all three areas (all p < 0.001). The accuracy of GPT-4 was rated to be significantly higher based on scores given by plastic surgery fellows compared to those of breast reconstruction specialists (p = 0.012), whereas there was no significant difference between these scores for Grok. CONCLUSIONS: The study suggests that GPT-4 has the potential to assist in interpreting and applying clinical guidelines for IBBR but importantly there is still a risk that AI chatbots can misinform. Further studies are needed to understand the broader role of current and future AI chatbots in breast reconstruction surgery. LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine Ratings, please refer to Table of Contents or online Instructions to Authors www.springer.com/00266 .",Shiraishi M; Sowa Y; Tomita K; Terao Y; Satake T; Muto M; Morita Y; Higai S; Toyohara Y; Kurokawa Y; Sunaga A; Okazaki M
38693697,Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care.,2024,JMIR medical education,,,,"BACKGROUND: Artificial intelligence (AI) chatbots, such as ChatGPT-4, have shown immense potential for application across various aspects of medicine, including medical education, clinical practice, and research. OBJECTIVE: This study aimed to evaluate the performance of ChatGPT-4 in the 2023 Taiwan Audiologist Qualification Examination, thereby preliminarily exploring the potential utility of AI chatbots in the fields of audiology and hearing care services. METHODS: ChatGPT-4 was tasked to provide answers and reasoning for the 2023 Taiwan Audiologist Qualification Examination. The examination encompassed six subjects: (1) basic auditory science, (2) behavioral audiology, (3) electrophysiological audiology, (4) principles and practice of hearing devices, (5) health and rehabilitation of the auditory and balance systems, and (6) auditory and speech communication disorders (including professional ethics). Each subject included 50 multiple-choice questions, with the exception of behavioral audiology, which had 49 questions, amounting to a total of 299 questions. RESULTS: The correct answer rates across the 6 subjects were as follows: 88% for basic auditory science, 63% for behavioral audiology, 58% for electrophysiological audiology, 72% for principles and practice of hearing devices, 80% for health and rehabilitation of the auditory and balance systems, and 86% for auditory and speech communication disorders (including professional ethics). The overall accuracy rate for the 299 questions was 75%, which surpasses the examination's passing criteria of an average 60% accuracy rate across all subjects. A comprehensive review of ChatGPT-4's responses indicated that incorrect answers were predominantly due to information errors. CONCLUSIONS: ChatGPT-4 demonstrated a robust performance in the Taiwan Audiologist Qualification Examination, showcasing effective logical reasoning skills. Our results suggest that with enhanced information accuracy, ChatGPT-4's performance could be further improved. This study indicates significant potential for the application of AI chatbots in audiology and hearing care services.",Wang S; Mo C; Chen Y; Dai X; Wang H; Shen X
38184368,Assessing the clinical reasoning of ChatGPT for mechanical thrombectomy in patients with stroke.,2024,Journal of neurointerventional surgery,,,,"BACKGROUND: Artificial intelligence (AI) has become a promising tool in medicine. ChatGPT, a large language model AI Chatbot, shows promise in supporting clinical practice. We assess the potential of ChatGPT as a clinical reasoning tool for mechanical thrombectomy in patients with stroke. METHODS: An internal validation of the abilities of ChatGPT was first performed using artificially created patient scenarios before assessment of real patient scenarios from the medical center's stroke database. All patients with large vessel occlusions who underwent mechanical thrombectomy at Tulane Medical Center between January 1, 2022 and December 31, 2022 were included in the study. The performance of ChatGPT in evaluating which patients should undergo mechanical thrombectomy was compared with the decisions made by board-certified stroke neurologists and neurointerventionalists. The interpretation skills, clinical reasoning, and accuracy of ChatGPT were analyzed. RESULTS: 102 patients with large vessel occlusions underwent mechanical thrombectomy. ChatGPT agreed with the physician's decision whether or not to pursue thrombectomy in 54.3% of the cases. ChatGPT had mistakes in 8.8% of the cases, consisting of mathematics, logic, and misinterpretation errors. In the internal validation phase, ChatGPT was able to provide nuanced clinical reasoning and was able to perform multi-step thinking, although with an increased rate of making mistakes. CONCLUSION: ChatGPT shows promise in clinical reasoning, including the ability to factor a patient's underlying comorbidities when considering mechanical thrombectomy. However, ChatGPT is prone to errors as well and should not be relied on as a sole decision-making tool in its present form, but it has potential to assist clinicians with more efficient work flow.",Chen TC; Couldwell MW; Singer J; Singer A; Koduri L; Kaminski E; Nguyen K; Multala E; Dumont AS; Wang A
40191936,The role of artificial intelligence in cardiovascular research: Fear less and live bolder.,2025,European journal of clinical investigation,,,,"BACKGROUND: Artificial intelligence (AI) has captured the attention of everyone, including cardiovascular (CV) clinicians and scientists. Moving beyond philosophical debates, modern cardiology cannot overlook AI's growing influence but must actively explore its potential applications in clinical practice and research methodology. METHODS AND RESULTS: AI offers exciting possibilities for advancing CV medicine by uncovering disease heterogeneity, integrating complex multimodal data, and enhancing treatment strategies. In this review, we discuss the innovative applications of AI in cardiac electrophysiology, imaging, angiography, biomarkers, and genomic data, as well as emerging tools like face recognition and speech analysis. Furthermore, we focus on the expanding role of machine learning (ML) in predicting CV risk and outcomes, outlining a roadmap for the implementation of AI in CV care delivery. While the future of AI holds great promise, technical limitations and ethical challenges remain significant barriers to its widespread clinical adoption. CONCLUSIONS: Addressing these issues through the development of high-quality standards and involving key stakeholders will be essential for AI to transform cardiovascular care safely and effectively.",Scuricini A; Ramoni D; Liberale L; Montecucco F; Carbone F
38528129,Exploring the Unknown: Evaluating ChatGPT's Performance in Uncovering Novel Aspects of Plastic Surgery and Identifying Areas for Future Innovation.,2024,Aesthetic plastic surgery,,,,"BACKGROUND: Artificial intelligence (AI) has emerged as a powerful tool in various medical fields, including plastic surgery. This study aims to evaluate the performance of ChatGPT, an AI language model, in elucidating historical aspects of plastic surgery and identifying potential avenues for innovation. METHODS: A comprehensive analysis of ChatGPT's responses to a diverse range of plastic surgery-related inquiries was performed. The quality of the AI-generated responses was assessed based on their relevance, accuracy, and novelty. Additionally, the study examined the AI's ability to recognize gaps in existing knowledge and propose innovative solutions. ChatGPT's responses were analysed by specialist plastic surgeons with extensive research experience, and quantitatively analysed with a Likert scale. RESULTS: ChatGPT demonstrated a high degree of proficiency in addressing a wide array of plastic surgery-related topics. The AI-generated responses were found to be relevant and accurate in most cases. However, it demonstrated convergent thinking and failed to generate genuinely novel ideas to revolutionize plastic surgery. Instead, it suggested currently popular trends that demonstrate great potential for further advancements. Some of the references presented were also erroneous as they cannot be validated against the existing literature. CONCLUSION: Although ChatGPT requires major improvements, this study highlights its potential as an effective tool for uncovering novel aspects of plastic surgery and identifying areas for future innovation. By leveraging the capabilities of AI language models, plastic surgeons may drive advancements in the field. Further studies are needed to cautiously explore the integration of AI-driven insights into clinical practice and to evaluate their impact on patient outcomes. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266.",Lim B; Seth I; Xie Y; Kenney PS; Cuomo R; Rozen WM
37713254,The Potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic Diseases: Exploratory Study.,2023,Journal of medical Internet research,,,,"BACKGROUND: Artificial intelligence (AI) has gained tremendous popularity recently, especially the use of natural language processing (NLP). ChatGPT is a state-of-the-art chatbot capable of creating natural conversations using NLP. The use of AI in medicine can have a tremendous impact on health care delivery. Although some studies have evaluated ChatGPT's accuracy in self-diagnosis, there is no research regarding its precision and the degree to which it recommends medical consultations. OBJECTIVE: The aim of this study was to evaluate ChatGPT's ability to accurately and precisely self-diagnose common orthopedic diseases, as well as the degree of recommendation it provides for medical consultations. METHODS: Over a 5-day course, each of the study authors submitted the same questions to ChatGPT. The conditions evaluated were carpal tunnel syndrome (CTS), cervical myelopathy (CM), lumbar spinal stenosis (LSS), knee osteoarthritis (KOA), and hip osteoarthritis (HOA). Answers were categorized as either correct, partially correct, incorrect, or a differential diagnosis. The percentage of correct answers and reproducibility were calculated. The reproducibility between days and raters were calculated using the Fleiss kappa coefficient. Answers that recommended that the patient seek medical attention were recategorized according to the strength of the recommendation as defined by the study. RESULTS: The ratios of correct answers were 25/25, 1/25, 24/25, 16/25, and 17/25 for CTS, CM, LSS, KOA, and HOA, respectively. The ratios of incorrect answers were 23/25 for CM and 0/25 for all other conditions. The reproducibility between days was 1.0, 0.15, 0.7, 0.6, and 0.6 for CTS, CM, LSS, KOA, and HOA, respectively. The reproducibility between raters was 1.0, 0.1, 0.64, -0.12, and 0.04 for CTS, CM, LSS, KOA, and HOA, respectively. Among the answers recommending medical attention, the phrases ""essential,"" ""recommended,"" ""best,"" and ""important"" were used. Specifically, ""essential"" occurred in 4 out of 125, ""recommended"" in 12 out of 125, ""best"" in 6 out of 125, and ""important"" in 94 out of 125 answers. Additionally, 7 out of the 125 answers did not include a recommendation to seek medical attention. CONCLUSIONS: The accuracy and reproducibility of ChatGPT to self-diagnose five common orthopedic conditions were inconsistent. The accuracy could potentially be improved by adding symptoms that could easily identify a specific location. Only a few answers were accompanied by a strong recommendation to seek medical attention according to our study standards. Although ChatGPT could serve as a potential first step in accessing care, we found variability in accurate self-diagnosis. Given the risk of harm with self-diagnosis without medical follow-up, it would be prudent for an NLP to include clear language alerting patients to seek expert medical opinions. We hope to shed further light on the use of AI in a future clinical study.",Kuroiwa T; Sarcon A; Ibara T; Yamada E; Yamamoto A; Tsukamoto K; Fujita K
37707884,The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation.,2023,JMIR medical education,,,,"BACKGROUND: Artificial intelligence (AI) has many applications in various aspects of our daily life, including health, criminal, education, civil, business, and liability law. One aspect of AI that has gained significant attention is natural language processing (NLP), which refers to the ability of computers to understand and generate human language. OBJECTIVE: This study aims to examine the potential for, and concerns of, using AI in scientific research. For this purpose, high-impact research articles were generated by analyzing the quality of reports generated by ChatGPT and assessing the application's impact on the research framework, data analysis, and the literature review. The study also explored concerns around ownership and the integrity of research when using AI-generated text. METHODS: A total of 4 articles were generated using ChatGPT, and thereafter evaluated by 23 reviewers. The researchers developed an evaluation form to assess the quality of the articles generated. Additionally, 50 abstracts were generated using ChatGPT and their quality was evaluated. The data were subjected to ANOVA and thematic analysis to analyze the qualitative data provided by the reviewers. RESULTS: When using detailed prompts and providing the context of the study, ChatGPT would generate high-quality research that could be published in high-impact journals. However, ChatGPT had a minor impact on developing the research framework and data analysis. The primary area needing improvement was the development of the literature review. Moreover, reviewers expressed concerns around ownership and the integrity of the research when using AI-generated text. Nonetheless, ChatGPT has a strong potential to increase human productivity in research and can be used in academic writing. CONCLUSIONS: AI-generated text has the potential to improve the quality of high-impact research articles. The findings of this study suggest that decision makers and researchers should focus more on the methodology part of the research, which includes research design, developing research tools, and analyzing data in depth, to draw strong theoretical and practical implications, thereby establishing a revolution in scientific research in the era of AI. The practical implications of this study can be used in different fields such as medical education to deliver materials to develop the basic competencies for both medicine students and faculty members.",Khlaif ZN; Mousa A; Hattab MK; Itmazi J; Hassan AA; Sanmugam M; Ayyoub A
39318412,Appraisal of ChatGPT's responses to common patient questions regarding Tommy John surgery.,2024,Shoulder & elbow,,,,"BACKGROUND: Artificial intelligence (AI) has progressed at a fast pace. ChatGPT, a rapidly expanding AI platform, has several growing applications in medicine and patient care. However, its ability to provide high-quality answers to patient questions about orthopedic procedures such as Tommy John surgery is unknown. Our objective is to evaluate the quality of information provided by ChatGPT 3.5 and 4.0 in response to patient questions regarding Tommy John surgery. METHODS: Twenty-five patient questions regarding Tommy John surgery were posed to ChatGPT 3.5 and 4.0. Readability was assessed via Flesch Kincaid Reading Ease, Flesh Kinkaid Grade Level, Gunning Fog Score, Simple Measure of Gobbledygook, Coleman Liau, and Automated Readability Index. The quality of each response was graded using a 5-point Likert scale. RESULTS: ChatGPT generated information at an educational level that greatly exceeds the recommended level. ChatGPT 4.0 produced slightly better responses to common questions regarding Tommy John surgery with fewer inaccuracies than ChatGPT 3.5. CONCLUSION: Although ChatGPT can provide accurate information regarding Tommy John surgery, its responses may not be easily comprehended by the average patient. As AI platforms become more accessible to the public, patients must be aware of their limitations.",Shaari AL; Fano AN; Anakwenze O; Klifto C
38133911,Medical Student Experiences and Perceptions of ChatGPT and Artificial Intelligence: Cross-Sectional Study.,2023,JMIR medical education,,,,"BACKGROUND: Artificial intelligence (AI) has the potential to revolutionize the way medicine is learned, taught, and practiced, and medical education must prepare learners for these inevitable changes. Academic medicine has, however, been slow to embrace recent AI advances. Since its launch in November 2022, ChatGPT has emerged as a fast and user-friendly large language model that can assist health care professionals, medical educators, students, trainees, and patients. While many studies focus on the technology's capabilities, potential, and risks, there is a gap in studying the perspective of end users. OBJECTIVE: The aim of this study was to gauge the experiences and perspectives of graduating medical students on ChatGPT and AI in their training and future careers. METHODS: A cross-sectional web-based survey of recently graduated medical students was conducted in an international academic medical center between May 5, 2023, and June 13, 2023. Descriptive statistics were used to tabulate variable frequencies. RESULTS: Of 325 applicants to the residency programs, 265 completed the survey (an 81.5% response rate). The vast majority of respondents denied using ChatGPT in medical school, with 20.4% (n=54) using it to help complete written assessments and only 9.4% using the technology in their clinical work (n=25). More students planned to use it during residency, primarily for exploring new medical topics and research (n=168, 63.4%) and exam preparation (n=151, 57%). Male students were significantly more likely to believe that AI will improve diagnostic accuracy (n=47, 51.7% vs n=69, 39.7%; P=.001), reduce medical error (n=53, 58.2% vs n=71, 40.8%; P=.002), and improve patient care (n=60, 65.9% vs n=95, 54.6%; P=.007). Previous experience with AI was significantly associated with positive AI perception in terms of improving patient care, decreasing medical errors and misdiagnoses, and increasing the accuracy of diagnoses (P=.001, P<.001, P=.008, respectively). CONCLUSIONS: The surveyed medical students had minimal formal and informal experience with AI tools and limited perceptions of the potential uses of AI in health care but had overall positive views of ChatGPT and AI and were optimistic about the future of AI in medical education and health care. Structured curricula and formal policies and guidelines are needed to adequately prepare medical learners for the forthcoming integration of AI in medicine.",Alkhaaldi SMI; Kassab CH; Dimassi Z; Oyoun Alsoud L; Al Fahim M; Al Hageh C; Ibrahim H
38708385,Evaluating Artificial Intelligence's Role in Teaching the Reporting and Interpretation of Computed Tomographic Angiography for Preoperative Planning of the Deep Inferior Epigastric Artery Perforator Flap.,2024,JPRAS open,,,,"BACKGROUND: Artificial intelligence (AI) has the potential to transform preoperative planning for breast reconstruction by enhancing the efficiency, accuracy, and reliability of radiology reporting through automatic interpretation and perforator identification. Large language models (LLMs) have recently advanced significantly in medicine. This study aimed to evaluate the proficiency of contemporary LLMs in interpreting computed tomography angiography (CTA) scans for deep inferior epigastric perforator (DIEP) flap preoperative planning. METHODS: Four prominent LLMs, ChatGPT-4, BARD, Perplexity, and BingAI, answered six questions on CTA scan reporting. A panel of expert plastic surgeons with extensive experience in breast reconstruction assessed the responses using a Likert scale. In contrast, the responses' readability was evaluated using the Flesch Reading Ease score, the Flesch-Kincaid Grade level, and the Coleman-Liau Index. The DISCERN score was utilized to determine the responses' suitability. Statistical significance was identified through a t-test, and P-values < 0.05 were considered significant. RESULTS: BingAI provided the most accurate and useful responses to prompts, followed by Perplexity, ChatGPT, and then BARD. BingAI had the greatest Flesh Reading Ease (34.7+/-5.5) and DISCERN (60.5+/-3.9) scores. Perplexity had higher Flesch-Kincaid Grade level (20.5+/-2.7) and Coleman-Liau Index (17.8+/-1.6) scores than other LLMs. CONCLUSION: LLMs exhibit limitations in their capabilities of reporting CTA for preoperative planning of breast reconstruction, yet the rapid advancements in technology hint at a promising future. AI stands poised to enhance the education of CTA reporting and aid preoperative planning. In the future, AI technology could provide automatic CTA interpretation, enhancing the efficiency, accuracy, and reliability of CTA reports.",Lim B; Cevik J; Seth I; Sofiadellis F; Ross RJ; Rozen WM; Cuomo R
37863479,The Role of Artificial Intelligence in Surgery: What do General Surgery Residents Think?,2024,The American surgeon,,,,"BACKGROUND: Artificial intelligence (AI) holds significant potential in medical education and patient care, but its rapid emergence presents ethical and practical challenges. This study explored the perspectives of surgical residents on AI's role in medicine. METHODS: We performed a cross-sectional study surveying general surgery residents at a university-affiliated teaching hospital about their views on AI in medicine and surgical training. The survey covered demographics, residents' understanding of AI, its integration into medical practice, and use of AI tools like ChatGPT. The survey design was inspired by a recent national survey and underwent pretesting before deployment. RESULTS: Of the 31 participants surveyed, 24% identified diagnostics as AI's top application, 12% favored its use in identifying anatomical structures in surgeries, and 20% endorsed AI integration into EMRs for predictive models. Attitudes toward AI varied based on its intended application: 77.41% expressed concern about AI making life decisions and 70.97% felt excited about its application for repetitive tasks. A significant 67.74% believed AI could enhance the understanding of medical knowledge. Perception of AI integration varied with AI familiarity (P = .01), with more knowledgeable respondents expressing more positivity. Moreover, familiarity influenced the perceived academic use of ChatGPT (P = .039) and attitudes toward AI in operating rooms (P = .032). Conclusion: This study provides insights into surgery residents' perceptions of AI in medical practice and training. These findings can inform future research, shape policy decisions, and guide AI development, promoting a harmonious collaboration between AI and surgeons to improve both training and patient care.",St John A; Cooper L; Kavic SM
38912098,AI-Generated Graduate Medical Education Content for Total Joint Arthroplasty: Comparing ChatGPT Against Orthopaedic Fellows.,2024,Arthroplasty today,,,,"BACKGROUND: Artificial intelligence (AI) in medicine has primarily focused on diagnosing and treating diseases and assisting in the development of academic scholarly work. This study aimed to evaluate a new use of AI in orthopaedics: content generation for professional medical education. Quality, accuracy, and time were compared between content created by ChatGPT and orthopaedic surgery clinical fellows. METHODS: ChatGPT and 3 orthopaedic adult reconstruction fellows were tasked with creating educational summaries of 5 total joint arthroplasty-related topics. Responses were evaluated across 5 domains by 4 blinded reviewers from different institutions who are all current or former total joint arthroplasty fellowship directors or national arthroplasty board review course directors. RESULTS: ChatGPT created better orthopaedic content than fellows when mean aggregate scores for all 5 topics and domains were compared (P </= .001). The only domain in which fellows outperformed ChatGPT was the integration of key points and references (P = .006). ChatGPT outperformed the fellows in response time, averaging 16.6 seconds vs the fellows' 94 minutes per prompt (P = .002). CONCLUSIONS: With its efficient and accurate content generation, the current findings underscore ChatGPT's potential as an adjunctive tool to enhance orthopaedic arthroplasty graduate medical education. Future studies are warranted to explore AI's role further and optimize its utility in augmenting the educational development of arthroplasty trainees.",DeCook R; Muffly BT; Mahmood S; Holland CT; Ayeni AM; Ast MP; Bolognese MP; Guild GN 3rd; Sheth NP; Pean CA; Premkumar A
40384805,Exploring Filipino Medical Students' Attitudes and Perceptions of Artificial Intelligence in Medical Education: A Mixed-Methods Study.,2024,MedEdPublish (2016),,,,"BACKGROUND: Artificial intelligence (AI) is emerging as one of the most revolutionary technologies shaping the educational system utilized by this generation of learners globally. AI enables opportunities for innovative learning experiences, while helping teachers devise teaching strategies through automation and intelligent tutoring systems. The integration of AI into medical education has potential for advancing health management frameworks and elevating the quality of patient care. However, developing countries, including the Philippines, face issues on equitable AI use. Furthermore, medical educators struggle in learning AI which imposes a challenge in teaching its use. To address this, the current study aims to investigate the current perceptions of medical students on the role of AI in medical education and practice of medicine. METHODS: The study utilized a mixed-methods approach to quantitatively and qualitatively assess the current attitudes and perceptions of medicine students of AI. Quantitative assessment was done via survey and qualitative analysis via focus group discussion. Participants were composed of 20 medical students from the College of Medicine, University of the Philippines Manila. RESULTS: Analysis of the attitudes and perceptions of Filipino medical students on AI showed that participants had a baseline understanding and awareness, but lack opportunities in studying medicine and clinical practice. Majority of participants recognize the advantages in medical education but have reservations on its overall application in a clinical setting. CONCLUSIONS: The results of this investigation can direct future studies that aim to guide educators on the emerging role of AI in medical practice and the healthcare system, on its effect on physicians-in-training under contemporary medical educational practices. Findings from our study revealed key focal points which need to be sufficiently addressed in order to better equip medical students with knowledge, tools, and skills needed to utilize and integrate AI into their education and eventual practice as healthcare professionals.",Falcon RMG; Alcazar RMU; Babaran HG; Caragay BDB; Corpuz CAA; Kho MVS; Perez ACN; Isip-Tan ITC
37654681,Evaluation of Artificial Intelligence-generated Responses to Common Plastic Surgery Questions.,2023,Plastic and reconstructive surgery. Global open,,,,"BACKGROUND: Artificial intelligence (AI) is increasingly used to answer questions, yet the accuracy and validity of current tools are uncertain. In contrast to internet queries, AI generates summary responses as definitive. The internet is rife with inaccuracies, and plastic surgery management guidelines evolve, making verifiable information important. METHODS: We posed 10 questions about breast implant-associated illness, anaplastic large lymphoma, and squamous carcinoma to Bing, using the ""more balanced"" option, and to ChatGPT. Answers were reviewed by two plastic surgeons for accuracy and fidelity to information on the Food and Drug Administration (FDA) and American Society of Plastic Surgeons (ASPS) websites. We also presented 10 multiple-choice questions from the 2022 plastic surgery in-service examination to Bing, using the ""more precise"" option, and ChatGPT. Questions were repeated three times over consecutive weeks, and answers were evaluated for accuracy and stability. RESULTS: Compared with answers from the FDA and ASPS, Bing and ChatGPT were accurate. Bing answered 10 of the 30 multiple-choice questions correctly, nine incorrectly, and did not answer 11. ChatGPT correctly answered 16 and incorrectly answered 14. In both parts, responses from Bing were shorter, less detailed, and referred to verified and unverified sources; ChatGPT did not provide citations. CONCLUSIONS: These AI tools provided accurate information from the FDA and ASPS websites, but neither consistently answered questions requiring nuanced decision-making correctly. Advances in applications to plastic surgery will require algorithms that selectively identify, evaluate, and exclude information to enhance the accuracy, precision, validity, reliability, and utility of AI-generated responses.",Copeland-Halperin LR; O'Brien L; Copeland M
40423076,A Talk with ChatGPT: The Role of Artificial Intelligence in Shaping the Future of Cardiology and Electrophysiology.,2025,Journal of personalized medicine,,,,"Background: Artificial intelligence (AI) is poised to significantly impact the future of cardiology and electrophysiology, offering new tools to interpret complex datasets, improve diagnosis, optimize clinical workflows, and personalize therapy. ChatGPT-4o, a leading AI-based language model, exemplifies the transformative potential of AI in clinical research, medical education, and patient care. Aim and Methods: In this paper, we present an exploratory dialogue with ChatGPT to assess the role of AI in shaping the future of cardiology, with a particular focus on arrhythmia management and cardiac electrophysiology. Topics discussed include AI applications in ECG interpretation, arrhythmia detection, procedural guidance during ablation, and risk stratification for sudden cardiac death. We also examine the risks associated with AI use, including overreliance, interpretability challenges, data bias, and generalizability. Conclusions: The integration of AI into cardiovascular care offers the potential to enhance diagnostic accuracy, tailor interventions, and support decision-making. However, the adoption of AI must be carefully balanced with clinical expertise and ethical considerations. By fostering collaboration between clinicians and AI developers, it is possible to guide the development of reliable, transparent, and effective tools that will shape the future of personalized cardiology and electrophysiology.",Cersosimo A; Zito E; Pierucci N; Matteucci A; La Fazia VM
40046930,"Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology.",2025,Frontiers in medicine,,,,"BACKGROUND: Artificial intelligence (AI) is revolutionizing medical education; however, its limitations remain underexplored. This study evaluated the accuracy of three generative AI tools-ChatGPT-4, Copilot, and Google Gemini-in answering multiple-choice questions (MCQ) and short-answer questions (SAQ) related to cardiovascular pharmacology, a key subject in healthcare education. METHODS: Using free versions of each AI tool, we administered 45 MCQs and 30 SAQs across three difficulty levels: easy, intermediate, and advanced. AI-generated answers were reviewed by three pharmacology experts. The accuracy of MCQ responses was recorded as correct or incorrect, while SAQ responses were rated on a 1-5 scale based on relevance, completeness, and correctness. RESULTS: ChatGPT, Copilot, and Gemini demonstrated high accuracy scores in easy and intermediate MCQs (87-100%). While all AI models showed a decline in performance on the advanced MCQ section, only Copilot (53% accuracy) and Gemini (20% accuracy) had significantly lower scores compared to their performance on easy-intermediate levels. SAQ evaluations revealed high accuracy scores for ChatGPT (overall 4.7 +/- 0.3) and Copilot (overall 4.5 +/- 0.4) across all difficulty levels, with no significant differences between the two tools. In contrast, Gemini's SAQ performance was markedly lower across all levels (overall 3.3 +/- 1.0). CONCLUSION: ChatGPT-4 demonstrates the highest accuracy in addressing both MCQ and SAQ cardiovascular pharmacology questions, regardless of difficulty level. Copilot ranks second after ChatGPT, while Google Gemini shows significant limitations in handling complex MCQs and providing accurate responses to SAQ-type questions in this field. These findings can guide the ongoing refinement of AI tools for specialized medical education.",Salman IM; Ameer OZ; Khanfar MA; Hsieh YH
38106923,"Medicine and Pharmacy Students' Knowledge, Attitudes, and Practice regarding Artificial Intelligence Programs: Jordan and West Bank of Palestine.",2023,Advances in medical education and practice,,,,"BACKGROUND: Artificial intelligence (AI) programs generate responses to input text, showcasing their innovative capabilities in education and demonstrating various potential benefits, particularly in the field of medical education. The current knowledge of health profession students about AI programs has still not been assessed in Jordan and the West Bank of Palestine (WBP). AIM: This study aimed to assess students' awareness and practice of AI programs in medicine and pharmacy in Jordan and the WBP. METHODS: This study was in the form of an observational, cross-sectional survey. A questionnaire was electronically distributed among students of medicine and pharmacy at An-Najah National University (WBP), Al-Isra University (Jordan), and Al-Balqa Applied University (Jordan). The questionnaire consisted of three main categories: sociodemographic characteristics of the participants, practice of AI programs, and perceptions of AI programs, including ChatGPT. RESULTS: A total of 321 students responded to the distributed questionnaire, and 261 participants (81.3%) stated that they had heard about AI programs. In addition, 135 participants had used AI programs before (42.1%), while less than half the participants used them in their university studies (44.2%): for drug information (44.5%), homework (38.9%), and writing research articles (39.3%). There was significantly (48.3%, P<0.005) more conviction in the use of AI programs for writing research articles among pharmacy students from Palestine compared to Jordan. Lastly, there was significantly more (53.8%, P<0.05) AI program use among medicine students than pharmacy students. CONCLUSION: While most medicine and pharmacy students had heard about AI programs, only a small proportion of the participants had used them in their medical study. In addition, attitudes and practice related to AI programs in their education differs between medicine and pharmacy students and between WBP and Jordan.",Mosleh R; Jarrar Q; Jarrar Y; Tazkarji M; Hawash M
38649959,Artificial intelligence and medical education: application in classroom instruction and student assessment using a pharmacology & therapeutics case study.,2024,BMC medical education,,,,"BACKGROUND: Artificial intelligence (AI) tools are designed to create or generate content from their trained parameters using an online conversational interface. AI has opened new avenues in redefining the role boundaries of teachers and learners and has the potential to impact the teaching-learning process. METHODS: In this descriptive proof-of- concept cross-sectional study we have explored the application of three generative AI tools on drug treatment of hypertension theme to generate: (1) specific learning outcomes (SLOs); (2) test items (MCQs- A type and case cluster; SAQs; OSPE); (3) test standard-setting parameters for medical students. RESULTS: Analysis of AI-generated output showed profound homology but divergence in quality and responsiveness to refining search queries. The SLOs identified key domains of antihypertensive pharmacology and therapeutics relevant to stages of the medical program, stated with appropriate action verbs as per Bloom's taxonomy. Test items often had clinical vignettes aligned with the key domain stated in search queries. Some test items related to A-type MCQs had construction defects, multiple correct answers, and dubious appropriateness to the learner's stage. ChatGPT generated explanations for test items, this enhancing usefulness to support self-study by learners. Integrated case-cluster items had focused clinical case description vignettes, integration across disciplines, and targeted higher levels of competencies. The response of AI tools on standard-setting varied. Individual questions for each SAQ clinical scenario were mostly open-ended. The AI-generated OSPE test items were appropriate for the learner's stage and identified relevant pharmacotherapeutic issues. The model answers supplied for both SAQs and OSPEs can aid course instructors in planning classroom lessons, identifying suitable instructional methods, establishing rubrics for grading, and for learners as a study guide. Key lessons learnt for improving AI-generated test item quality are outlined. CONCLUSIONS: AI tools are useful adjuncts to plan instructional methods, identify themes for test blueprinting, generate test items, and guide test standard-setting appropriate to learners' stage in the medical program. However, experts need to review the content validity of AI-generated output. We expect AIs to influence the medical education landscape to empower learners, and to align competencies with curriculum implementation. AI literacy is an essential competency for health professionals.",Sridharan K; Sequeira RP
38728687,The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.,2024,JMIR medical informatics,,,,"BACKGROUND: Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM. OBJECTIVE: Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field. METHODS: Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data. RESULTS: A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills. CONCLUSIONS: LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.",Preiksaitis C; Ashenburg N; Bunney G; Chu A; Kabeer R; Riley F; Ribeira R; Rose C
39352738,"""Doctor ChatGPT, Can You Help Me?"" The Patient's Perspective: Cross-Sectional Study.",2024,Journal of medical Internet research,,,,"BACKGROUND: Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients. OBJECTIVE: This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant. METHODS: This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: ""Would this answer have helped you?"") on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness. RESULTS: In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant's age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients. CONCLUSIONS: The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT's responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers.",Armbruster J; Bussmann F; Rothhaas C; Titze N; Grutzner PA; Freischmidt H
39814698,Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis.,2025,JMIR medical informatics,,,,"BACKGROUND: Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic. OBJECTIVE: This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers. METHODS: We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel. RESULTS: Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot's decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis. CONCLUSIONS: Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use.",Zhang Y; Lu X; Luo Y; Zhu Y; Ling W
40387311,The Performance of AI in Dermatology Exams: The Exam Success and Limits of ChatGPT.,2025,Journal of cosmetic dermatology,,,,"BACKGROUND: Artificial intelligence holds significant potential in dermatology. OBJECTIVES: This study aimed to explore the potential and limitations of artificial intelligence applications in dermatology education by evaluating ChatGPT's performance on questions from the dermatology residency exam. METHOD: In this study, the dermatology residency exam results for ChatGPT versions 3.5 and 4.0 were compared with those of resident doctors across various seniority levels. Dermatology resident doctors were categorized into four seniority levels based on their education, and a total of 100 questions-25 multiple-choice questions for each seniority level-were included in the exam. The same questions were also administered to ChatGPT versions 3.5 and 4.0, and the scores were analyzed statistically. RESULTS: ChatGPT 3.5 performed poorly, especially when compared to senior residents. Second (p = 0.038), third (p = 0.041), and fourth-year senior resident physicians (p = 0.020) scored significantly higher than ChatGPT 3.5. ChatGPT 4.0 showed similar performance compared to first- and third-year senior resident physicians, but performed worse in comparison to second (p = 0.037) and fourth-year senior resident physicians (p = 0.029). Both versions scored lower as seniority and exam difficulty increased. ChatGPT 3.5 passed the first and second-year exams but failed the third and fourth-year exams. ChatGPT 4.0 passed the first, second, and third-year exams but failed the fourth-year exam. These findings suggest that ChatGPT was not on par with senior resident physicians, particularly on topics requiring advanced knowledge; however, version 4.0 proved to be more effective than version 3.5. CONCLUSION: In the future, as ChatGPT's language support and knowledge of medicine improve, it can be used more effectively in educational processes.",Gocer Gurok N; Ozturk S
40393042,Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.,2025,JMIR formative research,,,,"BACKGROUND: Artificial intelligence is becoming a part of daily life and the medical field. Generative artificial intelligence models, such as GPT-4 and ChatGPT, are experiencing a surge in popularity due to their enhanced performance and reliability. However, the application of these models in specialized domains, such as occupational medicine, remains largely unexplored. OBJECTIVE: This study aims to assess the potential suitability of a generative large language model, such as ChatGPT, as a support tool for medical research and even clinical decisions in occupational medicine in Germany. METHODS: In this randomized controlled study, the usability of ChatGPT for medical research and clinical decision-making was investigated using a web application developed for this purpose. Eligibility criteria were being a physician or medical student. Participants (N=56) were asked to work on 3 cases of occupational lung diseases and answer case-related questions. They were allocated via coin weighted for proportions of physicians in each group into 2 groups. One group researched the cases using an integrated chat application similar to ChatGPT based on the latest GPT-4-Turbo model, while the other used their usual research methods, such as Google, Amboss, or DocCheck. The primary outcome was case performance based on correct answers, while secondary outcomes included changes in specific question accuracy and self-assessed occupational medicine expertise before and after case processing. Group assignment was not traditionally blinded, as the chat window indicated membership; participants only knew the study examined web-based research, not group specifics. RESULTS: Participants of the ChatGPT group (n=27) showed better performance in specific research, for example, for potentially hazardous substances or activities (eg, case 1: ChatGPT group 2.5 hazardous substances that cause pleural changes versus 1.8 in a group with own research; P=.01; Cohen r=-0.38), and led to an increase in self-assessment with regard to specialist knowledge (from 3.9 to 3.4 in the ChatGPT group vs from 3.5 to 3.4 in the own research group; German school grades between 1=very good and 6=unsatisfactory; P=.047). However, clinical decisions, for example, whether an occupational disease report should be filed, were more often made correctly as a result of the participant's own research (n=29; eg, case 1: Should an occupational disease report be filed? Yes for 7 participants in the ChatGPT group vs 14 in their own research group; P=.007; odds ratio 6.00, 95% CI 1.54-23.36). CONCLUSIONS: ChatGPT can be a useful tool for targeted medical research, even for rather specific questions in occupational medicine regarding occupational diseases. However, clinical decisions should currently only be supported and not made by the large language model. Future systems should be critically assessed, even if the initial results are promising.",Weuthen FA; Otte N; Krabbe H; Kraus T; Krabbe J
38546736,Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.,2024,JMIR medical education,,,,"BACKGROUND: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. OBJECTIVE: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. METHODS: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. RESULTS: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). CONCLUSIONS: Examination of artificial intelligence's answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.",Noda M; Ueno T; Koshu R; Takaso Y; Shimada MD; Saito C; Sugimoto H; Fushiki H; Ito M; Nomura A; Yoshizaki T
38935937,Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.,2024,Journal of medical Internet research,,,,"BACKGROUND: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. OBJECTIVE: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types. METHODS: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. RESULTS: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions. CONCLUSIONS: ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.",Lahat A; Sharif K; Zoabi N; Shneor Patt Y; Sharif Y; Fisher L; Shani U; Arow M; Levin R; Klang E
39147878,A pilot feasibility study comparing large language models in extracting key information from ICU patient text records from an Irish population.,2024,Intensive care medicine experimental,,,,"BACKGROUND: Artificial intelligence, through improved data management and automated summarisation, has the potential to enhance intensive care unit (ICU) care. Large language models (LLMs) can interrogate and summarise large volumes of medical notes to create succinct discharge summaries. In this study, we aim to investigate the potential of LLMs to accurately and concisely synthesise ICU discharge summaries. METHODS: Anonymised clinical notes from ICU admissions were used to train and validate a prompting structure in three separate LLMs (ChatGPT, GPT-4 API and Llama 2) to generate concise clinical summaries. Summaries were adjudicated by staff intensivists on ability to identify and appropriately order a pre-defined list of important clinical events as well as readability, organisation, succinctness, and overall rank. RESULTS: In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation. Overall ability to identify a pre-defined list of important clinical events in the summary was 41.5 +/- 15.2% for GPT-4 API, 19.2 +/- 20.9% for ChatGPT and 16.5 +/- 14.1% for Llama2 (p = 0.002). GPT-4 API followed by ChatGPT had the highest score to appropriately order a pre-defined list of important clinical events in the summary as well as readability, organisation, succinctness, and overall rank, whilst Llama2 scored lowest for all. GPT-4 API produced minor hallucinations, which were not present in the other models. CONCLUSION: Differences exist in large language model performance in readability, organisation, succinctness, and sequencing of clinical events compared to others. All encountered issues with narrative coherence and omitted key clinical data and only moderately captured all clinically meaningful data in the correct order. However, these technologies suggest future potential for creating succinct discharge summaries.",Urquhart E; Ryan J; Hartigan S; Nita C; Hanley C; Moran P; Bates J; Jooste R; Judge C; Laffey JG; Madden MG; McNicholas BA
38520394,Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines.,2024,Journal of Crohn's & colitis,,,,"BACKGROUND: As acceptance of artificial intelligence [AI] platforms increases, more patients will consider these tools as sources of information. The ChatGPT architecture utilizes a neural network to process natural language, thus generating responses based on the context of input text. The accuracy and completeness of ChatGPT3.5 in the context of inflammatory bowel disease [IBD] remains unclear. METHODS: In this prospective study, 38 questions worded by IBD patients were inputted into ChatGPT3.5. The following topics were covered: [1] Crohn's disease [CD], ulcerative colitis [UC], and malignancy; [2] maternal medicine; [3] infection and vaccination; and [4] complementary medicine. Responses given by ChatGPT were assessed for accuracy [1-completely incorrect to 5-completely correct] and completeness [3-point Likert scale; range 1-incomplete to 3-complete] by 14 expert gastroenterologists, in comparison with relevant ECCO guidelines. RESULTS: In terms of accuracy, most replies [84.2%] had a median score of >/=4 (interquartile range [IQR]: 2) and a mean score of 3.87 [SD: +/-0.6]. For completeness, 34.2% of the replies had a median score of 3 and 55.3% had a median score of between 2 and <3. Overall, the mean rating was 2.24 [SD: +/-0.4, median: 2, IQR: 1]. Though groups 3 and 4 had a higher mean for both accuracy and completeness, there was no significant scoring variation between the four question groups [Kruskal-Wallis test p > 0.05]. However, statistical analysis for the different individual questions revealed a significant difference for both accuracy [p < 0.001] and completeness [p < 0.001]. The questions which rated the highest for both accuracy and completeness were related to smoking, while the lowest rating was related to screening for malignancy and vaccinations especially in the context of immunosuppression and family planning. CONCLUSION: This is the first study to demonstrate the capability of an AI-based system to provide accurate and comprehensive answers to real-world patient queries in IBD. AI systems may serve as a useful adjunct for patients, in addition to standard of care in clinics and validated patient information resources. However, responses in specialist areas may deviate from evidence-based guidance and the replies need to give more firm advice.",Sciberras M; Farrugia Y; Gordon H; Furfaro F; Allocca M; Torres J; Arebi N; Fiorino G; Iacucci M; Verstockt B; Magro F; Katsanos K; Busuttil J; De Giovanni K; Fenech VA; Chetcuti Zammit S; Ellul P
38219629,Artificial intelligence and ChatGPT: An otolaryngology patient's ally or foe?,2024,American journal of otolaryngology,,,,"BACKGROUND: As artificial intelligence (AI) is integrating into the healthcare sphere, there is a need to evaluate its effectiveness in the various subspecialties of medicine, including otolaryngology. Our study intends to provide a cursory review of ChatGPT's diagnostic capability, ability to convey pathophysiology in simple terms, accuracy in providing management recommendations, and appropriateness in follow up and post-operative recommendations in common otolaryngologic conditions. METHODS: Adenotonsillectomy (T&A), tympanoplasty (TP), endoscopic sinus surgery (ESS), parotidectomy (PT), and total laryngectomy (TL) were substituted for the word procedure in the following five questions and input into ChatGPT version 3.5: ""How do I know if I need (procedure),"" ""What are treatment alternatives to (procedure),"" ""What are the risks of (procedure),"" ""How is a (procedure) performed,"" and ""What is the recovery process for (procedure)?"" Two independent study members analyzed the output and discrepancies were reviewed, discussed, and reconciled between study members. RESULTS: In terms of management recommendations, ChatGPT was able to give generalized statements of evaluation, need for intervention, and the basics of the procedure without major aberrant errors or risks of safety. ChatGPT was successful in providing appropriate treatment alternatives in all procedures tested. When queried for methodology, risks, and procedural steps, ChatGPT lacked precision in the description of procedural steps, missed key surgical details, and did not accurately provide all major risks of each procedure. In terms of the recovery process, ChatGPT showed promise in T&A, TP, ESS, and PT but struggled in the complexity of TL, stating the patient could speak immediately after surgery without speech therapy. CONCLUSIONS: ChatGPT accurately demonstrated the need for intervention, management recommendations, and treatment alternatives in common ENT procedures. However, ChatGPT was not able to replace an otolaryngologist's clinical reasoning necessary to discuss procedural methodology, risks, and the recovery process in complex procedures. As AI becomes further integrated into healthcare, there is a need to continue to explore its indications, evaluate its limits, and refine its use to the otolaryngologist's advantage.",Langlie J; Kamrava B; Pasick LJ; Mei C; Hoffer ME
37307503,ChatGPT in medical school: how successful is AI in progress testing?,2023,Medical education online,,,,"BACKGROUND: As generative artificial intelligence (AI), ChatGPT provides easy access to a wide range of information, including factual knowledge in the field of medicine. Given that knowledge acquisition is a basic determinant of physicians' performance, teaching and testing different levels of medical knowledge is a central task of medical schools. To measure the factual knowledge level of the ChatGPT responses, we compared the performance of ChatGPT with that of medical students in a progress test. METHODS: A total of 400 multiple-choice questions (MCQs) from the progress test in German-speaking countries were entered into ChatGPT's user interface to obtain the percentage of correctly answered questions. We calculated the correlations of the correctness of ChatGPT responses with behavior in terms of response time, word count, and difficulty of a progress test question. RESULTS: Of the 395 responses evaluated, 65.5% of the progress test questions answered by ChatGPT were correct. On average, ChatGPT required 22.8 s (SD 17.5) for a complete response, containing 36.2 (SD 28.1) words. There was no correlation between the time used and word count with the accuracy of the ChatGPT response (correlation coefficient for time rho = -0.08, 95% CI [-0.18, 0.02], t(393) = -1.55, p = 0.121; for word count rho = -0.03, 95% CI [-0.13, 0.07], t(393) = -0.54, p = 0.592). There was a significant correlation between the difficulty index of the MCQs and the accuracy of the ChatGPT response (correlation coefficient for difficulty: rho = 0.16, 95% CI [0.06, 0.25], t(393) = 3.19, p = 0.002). CONCLUSION: ChatGPT was able to correctly answer two-thirds of all MCQs at the German state licensing exam level in Progress Test Medicine and outperformed almost all medical students in years 1-3. The ChatGPT answers can be compared with the performance of medical students in the second half of their studies.",Friederichs H; Friederichs WJ; Marz M
40163112,Online Health Information-Seeking in the Era of Large Language Models: Cross-Sectional Web-Based Survey Study.,2025,Journal of medical Internet research,,,,"BACKGROUND: As large language model (LLM)-based chatbots such as ChatGPT (OpenAI) grow in popularity, it is essential to understand their role in delivering online health information compared to other resources. These chatbots often generate inaccurate content, posing potential safety risks. This motivates the need to examine how users perceive and act on health information provided by LLM-based chatbots. OBJECTIVE: This study investigates the patterns, perceptions, and actions of users seeking health information online, including LLM-based chatbots. The relationships between online health information-seeking behaviors and important sociodemographic characteristics are examined as well. METHODS: A web-based survey of crowd workers was conducted via Prolific. The questionnaire covered sociodemographic information, trust in health care providers, eHealth literacy, artificial intelligence (AI) attitudes, chronic health condition status, online health information source types, perceptions, and actions, such as cross-checking or adherence. Quantitative and qualitative analyses were applied. RESULTS: Most participants consulted search engines (291/297, 98%) and health-related websites (203/297, 68.4%) for their health information, while 21.2% (63/297) used LLM-based chatbots, with ChatGPT and Microsoft Copilot being the most popular. Most participants (268/297, 90.2%) sought information on health conditions, with fewer seeking advice on medication (179/297, 60.3%), treatments (137/297, 46.1%), and self-diagnosis (62/297, 23.2%). Perceived information quality and trust varied little across source types. The preferred source for validating information from the internet was consulting health care professionals (40/132, 30.3%), while only a very small percentage of participants (5/214, 2.3%) consulted AI tools to cross-check information from search engines and health-related websites. For information obtained from LLM-based chatbots, 19.4% (12/63) of participants cross-checked the information, while 48.4% (30/63) of participants followed the advice. Both of these rates were lower than information from search engines, health-related websites, forums, or social media. Furthermore, use of LLM-based chatbots for health information was negatively correlated with age (rho=-0.16, P=.006). In contrast, attitudes surrounding AI for medicine had significant positive correlations with the number of source types consulted for health advice (rho=0.14, P=.01), use of LLM-based chatbots for health information (rho=0.31, P<.001), and number of health topics searched (rho=0.19, P<.001). CONCLUSIONS: Although traditional online sources remain dominant, LLM-based chatbots are emerging as a resource for health information for some users, specifically those who are younger and have a higher trust in AI. The perceived quality and trustworthiness of health information varied little across source types. However, the adherence to health information from LLM-based chatbots seemed more cautious compared to search engines or health-related websites. As LLMs continue to evolve, enhancing their accuracy and transparency will be essential in mitigating any potential risks by supporting responsible information-seeking while maximizing the potential of AI in health contexts.",Yun HS; Bickmore T
37458761,Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases.,2023,Archives of gynecology and obstetrics,,,,"BACKGROUND: As the available information about breast cancer is growing every day, the decision-making process for the therapy is getting more complex. ChatGPT as a transformer-based language model possesses the ability to write scientific articles and pass medical exams. But is it able to support the multidisciplinary tumor board (MDT) in the planning of the therapy of patients with breast cancer? MATERIAL AND METHODS: We performed a pilot study on 10 consecutive cases of breast cancer patients discussed in MDT at our department in January 2023. Included were patients with a primary diagnosis of early breast cancer. The recommendation of MDT was compared with the recommendation of the ChatGPT for particular patients and the clinical score of the agreement was calculated. RESULTS: Results showed that ChatGPT provided mostly general answers regarding chemotherapy, breast surgery, radiation therapy, chemotherapy, and antibody therapy. It was able to identify risk factors for hereditary breast cancer and point out the elderly patient indicated for chemotherapy to evaluate the cost/benefit effect. ChatGPT wrongly identified the patient with Her2 1 + and 2 + (FISH negative) as in need of therapy with an antibody and called endocrine therapy ""hormonal treatment"". CONCLUSIONS: Support of artificial intelligence by finding individualized and personalized therapy for our patients in the time of rapidly expanding amount of information is looking for the ways in the clinical routine. ChatGPT has the potential to find its spot in clinical medicine, but the current version is not able to provide specific recommendations for the therapy of patients with primary breast cancer.",Lukac S; Dayan D; Fink V; Leinert E; Hartkopf A; Veselinovic K; Janni W; Rack B; Pfister K; Heitmeir B; Ebner F
40257390,Artificial intelligence in asthma health literacy: a comparative analysis of ChatGPT versus Gemini.,2025,The Journal of asthma : official journal of the Association for the Care of Asthma,,,,"BACKGROUND: Asthma is a complex and heterogeneous chronic disease affecting over 300 million individuals worldwide. Despite advances in pharmacotherapy, poor disease control remains a major challenge, necessitating innovative approaches to patient education and self-management. Artificial intelligence driven chatbots, such as ChatGPT and Gemini, have the potential to enhance asthma care by providing real-time, evidence-based information. As asthma management moves toward personalized medicine, AI could support individualized education and treatment guidance. However, concerns remain regarding the accuracy and reliability of AI-generated medical content. OBJECTIVE: This study evaluated the accuracy of ChatGPT (version 4.0) and Gemini (version 1.2) in providing asthma-related health information using the Patient-completed Asthma Knowledge Questionnaire, a validated asthma literacy tool. METHODS: A cross-sectional study was conducted in which both AI models answered 54 standardized asthma-related items. Responses were classified as correct or incorrect based on alignment with validated clinical knowledge. Accuracy was assessed using descriptive statistics, Cohen's kappa for inter-model agreement, and chi-square tests for comparative performance. RESULTS: ChatGPT achieved an accuracy of 96.3% (52/54 correct; 95% CI: 87.5%-99.0%), while Gemini scored 92.6% (50/54 correct; 95% CI: 82.5%-97.1%), with no statistically significant difference (p = 0.67). Cohen's kappa demonstrated near-perfect agreement for ChatGPT (kappa = 0.91) and strong agreement for Gemini (kappa = 0.82). CONCLUSION: ChatGPT and Gemini demonstrated high accuracy in delivering asthma-related health information, supporting their potential as adjunct tools for patient education. AI models could potentially play a role in personalized asthma management by providing tailored treatment guidance and improving patient engagement.",Hoj S; Backer V; Ulrik CS; Sigsgaard T; Meteran H
40345326,Artificial Intelligence-generated answers to patients' questions on asthma: the AIR-Asthma study.,2025,The journal of allergy and clinical immunology. In practice,,,,"BACKGROUND: Asthma is a prevalent chronic respiratory disease requiring ongoing patient education and individualized management. The increasing reliance on digital tools, particularly generative artificial intelligence (AI), to answer health-related questions has raised concerns about the accuracy, reliability, and comprehensibility of AI-generated information for people living with asthma. OBJECTIVE: To systematically evaluate reliability, accuracy, comprehensiveness, and understandability of responses generated by three widely used AI-based chatbots (ChatGPT, Bard, Copilot) to common questions formulated by people with asthma. METHODS: In this cross-sectional study, 15 questions regarding asthma management were formulated by patients and categorized by difficulty. Responses from ChatGPT, Bard, and Copilot were evaluated by international experts for accuracy and comprehensiveness, and by patient representatives for understandability. Reliability was assessed through consistency testing across devices. A blinded evaluation was conducted. RESULTS: A total of 21 experts and 16 patient representatives participated in the evaluation. ChatGPT demonstrated the highest reliability (15/15 responses), accuracy (median score 9.0 [IQR 7.0-9.0]), and comprehensiveness (8.0 [8.0-9.0]) compared to Bard and Copilot (P < 0.0001). Bard achieved superior scores in understandability (median score 9.0 [8.0-10.0]) (P < 0.0001). Performance differences were consistent across question difficulty levels. CONCLUSION: AI-driven chatbots can provide generally accurate and understandable responses to asthma-related questions. Variability in reliability and accuracy underscores the need for caution in clinical contexts. AI tools may complement but cannot replace professional medical advice in asthma management.",Nigro M; Aliverti A; Angelucci A; Braido F; Canonica GW; Bossios A; Pinnock H; Boyd J; Powell P; Aliberti S
36819954,ChatGPT Output Regarding Compulsory Vaccination and COVID-19 Vaccine Conspiracy: A Descriptive Study at the Outset of a Paradigm Shift in Online Search for Information.,2023,Cureus,,,,"BACKGROUND: Being on the verge of a revolutionary approach to gathering information, ChatGPT (an artificial intelligence (AI)-based language model developed by OpenAI, and capable of producing human-like text) could be the prime motive of a paradigm shift on how humans will acquire information. Despite the concerns related to the use of such a promising tool in relation to the future of the quality of education, this technology will soon be incorporated into web search engines mandating the need to evaluate the output of such a tool. Previous studies showed that dependence on some sources of online information (e.g., social media platforms) was associated with higher rates of vaccination hesitancy. Therefore, the aim of the current study was to describe the output of ChatGPT regarding coronavirus disease 2019 (COVID-19) vaccine conspiracy beliefs. and compulsory vaccination. METHODS: The current descriptive study was conducted on January 14, 2023 using the ChatGPT from OpenAI (OpenAI, L.L.C., San Francisco, CA, USA). The output was evaluated by two authors and the degree of agreement regarding the correctness, clarity, conciseness, and bias was evaluated using Cohen's kappa. RESULTS: The ChatGPT responses were dismissive of conspiratorial ideas about severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) origins labeling it as non-credible and lacking scientific evidence. Additionally, ChatGPT responses were totally against COVID-19 vaccine conspiracy statements. Regarding compulsory vaccination, ChatGPT responses were neutral citing the following as advantages of this strategy: protecting public health, maintaining herd immunity, reducing the spread of disease, cost-effectiveness, and legal obligation, and on the other hand, it cited the following as disadvantages of compulsory vaccination: ethical and legal concerns, mistrust and resistance, logistical challenges, and limited resources and knowledge. CONCLUSIONS: The current study showed that ChatGPT could be a source of information to challenge COVID-19 vaccine conspiracies. For compulsory vaccination, ChatGPT resonated with the divided opinion in the scientific community toward such a strategy; nevertheless, it detailed the pros and cons of this approach. As it currently stands, the judicious use of ChatGPT could be utilized as a user-friendly source of COVID-19 vaccine information that could challenge conspiracy ideas with clear, concise, and non-biased content. However, ChatGPT content cannot be used as an alternative to the original reliable sources of vaccine information (e.g., the World Health Organization [WHO] and the Centers for Disease Control and Prevention [CDC]).",Sallam M; Salim NA; Al-Tammemi AB; Barakat M; Fayyad D; Hallit S; Harapan H; Hallit R; Mahafzah A
39265225,The practical use of artificial intelligence in Transfusion Medicine and Apheresis.,2024,Transfusion and apheresis science : official journal of the World Apheresis Association : official journal of the European Society for Haemapheresis,,,,"BACKGROUND: Blood and plasma volume calculations are a daily part of practice for many Transfusion Medicine and Apheresis practitioners. Though many formulas exist, each facility may have their own modifications to consider. ChatGPT (Generative Pre-trained Transformer) provides a new and exciting pathway for those with no programming experience to create personalized programs to meet the demands of daily practice. Additionally, this pathway creates computer programs that provide accurate and reproducible outputs. Herein, we aimed to create a step-by-step process for clinicians to create customized computer programs for use in everyday practice. METHODS: We created a process of inputs to ChatGPT-4(0), which generated computer programming code. This code was copied and pasted into Notepad (and saved as a Python file) and Google Colaboratory to verify functionality. We validated the durability of our process by repeating it over a 5-day timeframe and by recruiting volunteers to reproduce our outputs using the suggested process. RESULTS: Computer code generated by ChatGPT-4(0) in response to our common language inputs was accurate and durable over time. The code was fully functional in both Python and Colaboratory. Volunteers reproduced our process and outputs with minimal assistance. CONCLUSION: We analyzed the practical application of ChatGPT-4(0) and artificial intelligence (AI) to perform daily calculations encountered in Transfusion Medicine. Our results provide a proof of concept that people with no programming experience can create customizable solutions for their own facilities. Our future work will expand to the creation of comprehensive and customizable websites designed for each individual user.",Anstey C; Ullman D; Su L; Su C; Siniard C; Simmons S; Edberg J; Williams LA 3rd
40252296,AI-Assisted Blood Gas Interpretation: A Comparative Study With an Emergency Physician.,2025,The American journal of emergency medicine,,,,"BACKGROUND: Blood gas interpretation is critical in emergency settings. Large language models like ChatGPT are increasingly used in clinical contexts, but their accuracy in interpreting arterial blood gases (ABGs) requires further validation. OBJECTIVE: To evaluate ChatGPT's interpretive concordance with an emergency physician across 25 theoretical ABG scenarios. METHODS: ABG cases covering respiratory and metabolic emergencies (e.g., COPD, DKA, AKI, sepsis, poisoning) were analyzed by both ChatGPT and a specialist. Five interpretation criteria were used: pH, primary disorder, compensation, likely diagnosis, and clinical recommendation. RESULTS: Concordance was >/=90% in COPD, asthma, and pulmonary edema; 80-90% in DKA, AKI, and lactic acidosis; <70% in toxicologic and mixed acid-base cases. ChatGPT's recommendations were clinically safe even when diagnostic clarity was limited. CONCLUSION: ChatGPT shows high concordance with clinical interpretation in typical ABG cases but has limitations in complex or contextual diagnoses. These findings support its potential as a supportive tool in emergency medicine.",Gun M
38076046,ChatGPT-assisted deep learning for diagnosing bone metastasis in bone scans: Bridging the AI Gap for Clinicians.,2023,Heliyon,,,,"BACKGROUND: Bone scans are often used to identify bone metastases, but their low specificity may necessitate further studies. Deep learning models may improve diagnostic accuracy but require both medical and programming expertise. Therefore, we investigated the feasibility of constructing a deep learning model employing ChatGPT for the diagnosis of bone metastasis in bone scans and to evaluate its diagnostic performance. METHOD: We examined 4626 consecutive cancer patients (age, 65.1 +/- 11.3 years; 2334 female) who had bone scans for metastasis assessment. A nuclear medicine physician developed a deep learning model using ChatGPT 3.5 (OpenAI). We employed ResNet50 as the backbone network and compared the diagnostic performance of four strategies (original training set, original training set with 1:10 class weight, 10-fold data augmentation for positive images only, and 10-fold data augmentation for all images) to address the class imbalance. We used a class activation map algorithm for visualization. RESULTS: Among the four strategies, the deep learning model with 10-fold data augmentation for positive cases only, using a batch size of 16 and an epoch size of 150, achieved the area under curve of 0.8156, the sensitivity of 56.0 %, and specificity of 88.7 %. The class activation map indicated that the model focused on disseminated bone metastases within the spine but might confuse them with benign spinal lesions or intense urinary activity. CONCLUSIONS: Our study illustrates that a clinical physician with rudimentary programming skills can develop a deep learning model for medical image analysis, such as diagnosing bone metastasis in bone scans using ChatGPT. Model visualization may offer guidance in enhancing deep learning model development, including preprocessing, and potentially support clinical decision-making processes.",Son HJ; Kim SJ; Pak S; Lee SH
37903939,Consulting the Digital Doctor: Google Versus ChatGPT as Sources of Information on Breast Implant-Associated Anaplastic Large Cell Lymphoma and Breast Implant Illness.,2024,Aesthetic plastic surgery,,,,"BACKGROUND: Breast implant-associated anaplastic large cell lymphoma (BIA-ALCL) is a rare complication associated with the use of breast implants. Breast implant illness (BII) is another potentially concerning issue related to breast implants. This study aims to assess the quality of ChatGPT as a potential source of patient education by comparing the answers to frequently asked questions on BIA-ALCL and BII provided by ChatGPT and Google. METHODS: The Google and ChatGPT answers to the 10 most frequently asked questions on the search terms ""breast implant associated anaplastic large cell lymphoma"" and ""breast implant illness"" were recorded. Five blinded breast plastic surgeons were then asked to grade the quality of the answers according to the Global Quality Score (GQS). A Wilcoxon paired t-test was performed to evaluate the difference in GQS ratings for Google and ChatGPT answers. The sources provided by Google and ChatGPT were also categorized and assessed. RESULTS: In a comparison of answers provided by Google and ChatGPT on BIA-ALCL and BII, ChatGPT significantly outperformed Google. For BIA-ALCL, Google's average score was 2.72 +/- 1.44, whereas ChatGPT scored an average of 4.18 +/- 1.04 (p < 0.01). For BII, Google's average score was 2.66 +/- 1.24, while ChatGPT scored an average of 4.28 +/- 0.97 (p < 0.01). The superiority of ChatGPT's responses was attributed to their comprehensive nature and recognition of existing knowledge gaps. However, some of ChatGPT's answers had inaccessible sources. CONCLUSION: ChatGPT outperforms Google in providing high-quality answers to commonly asked questions on BIA-ALCL and BII, highlighting the potential of AI technologies in patient education. LEVEL OF EVIDENCE: Level III, comparative study LEVEL OF EVIDENCE III: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Liu HY; Alessandri Bonetti M; De Lorenzi F; Gimbel ML; Nguyen VT; Egro FM
37692649,The Use of AI in Diagnosing Diseases and Providing Management Plans: A Consultation on Cardiovascular Disorders With ChatGPT.,2023,Cureus,,,,"BACKGROUND: Cardiovascular diseases (CVDs) have remained the leading causes of death worldwide and substantially contribute to loss of health and excess health system costs. According to WHO, cardiovascular diseases (CVDs) take an estimated 17.9 million lives each year. One of the reasons for an immensely high fatality in CVDs is lack of efficient diagnosis and prompt treatment. Timely recognition and management are crucial to minimize mortality. In the advancing world, AI (artificial intelligence) and machine learning technologies continue to progress, this advancement has opened new avenues for innovative approaches in the field of medicine. Despite the rapid development in the field of AI, there is a limited understanding of the potential benefits among clinicians and medical practitioners. METHODS: In this study, we aimed to investigate the potential that the AI language model holds to assist health practitioners in the diagnosis and treatment of cardiovascular disorders. We asked Chat Generative Pre-trained Transformer (ChatGPT) 10 hypothetical questions simulating clinical consultation. The responses given by ChatGPT were accessed for its accuracy and accessibility by a team of medical specialists and cardiologists with extensive experience in managing cardiovascular disorders.  Result: Out of the 10 clinical scenarios inserted in ChatGPT, eight were perfectly diagnosed, however, the other two answers given by ChatGPT were not entirely incorrect since those conditions were associated with the actual diagnosis. Furthermore, the management plans and the treatment protocols that were given by ChatGPT were in line with the literature and current medical knowledge. The exact drug names and regimens were not provided but a general guideline that was given by this AI tool is definitely beneficial for junior doctors in getting an idea on how to proceed or refresh their previous knowledge. CONCLUSION: ChatGPT is a valuable resource in the field of medicine. Its comprehensive and properly organized response in an understandable language has made it an effective and efficient tool to be used. However, it is crucial to note that its limitations, such as the need for all associated and typical signs, symptoms, and physical examination findings, and its inability to personalize treatments need to be acknowledged.",Rizwan A; Sadiq T
40396096,Discussion of the ability to use chatGPT to answer questions related to esophageal cancer of patient concern.,2025,Journal of family medicine and primary care,,,,"BACKGROUND: Chat Generation Pre-Trained Converter (ChatGPT) is a language processing model based on artificial intelligence (AI). It covers a wide range of topics, including medicine, and can provide patients with knowledge about esophageal cancer. OBJECTIVE: Based on its risk, this study aimed to assess ChatGPT's accuracy in answering patients' questions about esophageal cancer. METHODS: By referring to professional association websites, social software and the author's clinical experience, 55 questions concerned by Chinese patients and their families were generated and scored by two deputy chief physicians of esophageal cancer. The answers were: (1) comprehensive/correct, (2) incomplete/partially correct, (3) partially accurate, partially inaccurate, and (4) completely inaccurate/irrelevant. Score differences are resolved by a third reviewer. RESULTS: Out of 55 questions, 24 (43.6%) of the answers provided by ChatGPT were complete and correct, 13 (23.6%) were correct but incomplete, 18 (32.7%) were partially wrong, and no answers were completely wrong. Comprehensive and correct answers were highest in the field of prevention (50 percent), while partially incorrect answers were highest in the field of treatment (77.8 percent). CONCLUSION: ChatGPT can accurately answer the questions about the prevention and diagnosis of esophageal cancer, but it cannot accurately answer the questions about the treatment and prognosis of esophageal cancer. Further investigation and refinement of this widely used large-scale language model are needed before it can be recommended to patients with esophageal cancer, and ongoing research is still needed to verify the safety and accuracy of these tools and their medical applications.",Yu F; Lei M; Wang S; Liu M; Fu X; Yu Y
38854916,Surveyed veterinary students in Australia find ChatGPT practical and relevant while expressing no concern about artificial intelligence replacing veterinarians.,2024,Veterinary record open,,,,"BACKGROUND: Chat Generative Pre-trained Transformer (ChatGPT) is a freely available online artificial intelligence (AI) program capable of understanding and generating human-like language. This study assessed veterinary students' perceptions about ChatGPT in education and practice. It compared perceptions about ChatGPT between students who had completed a critical analysis task and those who had not. METHODS: This cross-sectional study surveyed 498 Doctor of Veterinary Medicine (DVM) students at The University of Sydney, Australia. Second-year DVM students researched a veterinary pathogen and then completed a critical analysis of ChatGPT (version 3.5) output for the same pathogen. A survey based on the Technology Acceptance Model was then delivered to all DVM students from all years of the programme, collecting data using Likert-style, categorical and free-text items. RESULTS: Over 75% of the 100 respondents reported having used ChatGPT. The students found ChatGPT's output relevant and practical for their use but perceived it as inaccurate. They perceived ChatGPT output to be more useful for veterinary students than for pet owners or veterinarians. Those who had completed the critical analysis assignment had a more positive view of ChatGPT's practicality for veterinary students but noted its authoritative tone even when delivering inaccurate information. Over 50% of the students agreed that information about tools such as ChatGPT should be included in the veterinary curriculum. Students agreed that veterinarians should embrace AI but disagreed that AI would eventually replace the need for veterinarians. CONCLUSIONS: A critical appraisal of outputs from AI tools such as ChatGPT may help prepare future veterinarians for the effective use of these tools.",Worthing KA; Roberts M; Slapeta J
38589561,Blepharoptosis Consultation with Artificial Intelligence: Aesthetic Surgery Advice and Counseling from Chat Generative Pre-Trained Transformer (ChatGPT).,2024,Aesthetic plastic surgery,,,,"BACKGROUND: Chat generative pre-trained transformer (ChatGPT) is a publicly available extensive artificial intelligence (AI) language model that leverages deep learning to generate text that mimics human conversations. In this study, the performance of ChatGPT was assessed by offering insightful and precise answers to a series of fictional questions and emulating a preliminary consultation on blepharoplasty. METHODS: ChatGPT was posed with questions derived from a blepharoplasty checklist provided by the American Society of Plastic Surgeons. Board-certified plastic surgeons and non-medical staff members evaluated the responses for accuracy, informativeness, and accessibility. RESULTS: Nine questions were used in this study. Regarding informativeness, the average score given by board-certified plastic surgeons was significantly lower than that given by non-medical staff members (2.89 +/-  0.72 vs 4.41 +/- 0.71; p = 0.042). No statistically significant differences were observed in accuracy (p = 0.56) or accessibility (p = 0.11). CONCLUSIONS: Our results emphasize the effectiveness of ChatGPT in simulating doctor-patient conversations during blepharoplasty. Non-medical individuals found its responses more informative compared with the surgeons. Although limited in terms of specialized guidance, ChatGPT offers foundational surgical information. Further exploration is warranted to elucidate the broader role of AI in esthetic surgical consultations. LEVEL OF EVIDENCE V: Observational study under respected authorities. This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Shiraishi M; Tanigawa K; Tomioka Y; Miyakuni A; Moriwaki Y; Yang R; Oba J; Okazaki M
39473997,Can ChatGPT Be Used as a Research Assistant and a Patient Consultant in Plastic Surgery? A Review of 3 Key Information Domains.,2024,Eplasty,,,,"BACKGROUND: Chat Generative Pretrained Transformer (ChatGPT), a newly developed pretrained artificial intelligence (AI) chatbot, is able to interpret and respond to user-generated questions. As such, many questions have been raised about its potential uses and limitations. While preliminary literature suggests that ChatGPT can be used in medicine as a research assistant and patient consultant, its reliability in providing original and accurate information is still unknown. Therefore, the purpose of this project was to conduct a review on the utility of ChatGPT in plastic surgery. METHODS: On August 25, 2023, a thorough literature search was conducted on PubMed. Papers involving ChatGPT and medical research were included. Papers that were not written in English were excluded. Related papers were evaluated and synthesized into 3 information domains: generating original research topics, summarizing and extracting information from medical literature and databases, and conducting patient consultation. RESULTS: Out of 57 initial papers, 8 met inclusion criteria. An additional 2 were added based on the references of relevant papers, bringing the total number to 10. ChatGPT can be useful in helping clinicians brainstorm and gain a general understanding of the literature landscape. However, its inability to give patient-specific information and act as a reliable source of information limit its use in patient consultation. CONCLUSION: ChatGPT can be a useful tool in the conception of and execution of literature searches and research information retrieval (with increased reliability when queries are specific); however, the technology is currently not reliable enough to be implemented in a clinical setting.",Campolo JA; Kwon DY; Henderson PW
37294147,ChatGPT failed Taiwan's Family Medicine Board Exam.,2023,Journal of the Chinese Medical Association : JCMA,,,,"BACKGROUND: Chat Generative Pre-trained Transformer (ChatGPT), OpenAI Limited Partnership, San Francisco, CA, USA is an artificial intelligence language model gaining popularity because of its large database and ability to interpret and respond to various queries. Although it has been tested by researchers in different fields, its performance varies depending on the domain. We aimed to further test its ability in the medical field. METHODS: We used questions from Taiwan's 2022 Family Medicine Board Exam, which combined both Chinese and English and covered various question types, including reverse questions and multiple-choice questions, and mainly focused on general medical knowledge. We pasted each question into ChatGPT and recorded its response, comparing it to the correct answer provided by the exam board. We used SAS 9.4 (Cary, North Carolina, USA) and Excel to calculate the accuracy rates for each question type. RESULTS: ChatGPT answered 52 questions out of 125 correctly, with an accuracy rate of 41.6%. The questions' length did not affect the accuracy rates. These were 45.5%, 33.3%, 58.3%, 50.0%, and 43.5% for negative-phrase questions, multiple-choice questions, mutually exclusive options, case scenario questions, and Taiwan's local policy-related questions, with no statistical difference observed. CONCLUSION: ChatGPT's accuracy rate was not good enough for Taiwan's Family Medicine Board Exam. Possible reasons include the difficulty level of the specialist exam and the relatively weak database of traditional Chinese language resources. However, ChatGPT performed acceptably in negative-phrase questions, mutually exclusive questions, and case scenario questions, and it can be a helpful tool for learning and exam preparation. Future research can explore ways to improve ChatGPT's accuracy rate for specialized exams and other domains.",Weng TL; Wang YM; Chang S; Chen TJ; Hwang SJ
37822477,Perception of Chat Generative Pre-trained Transformer (Chat-GPT) AI tool amongst MSK clinicians.,2023,Journal of clinical orthopaedics and trauma,,,,"BACKGROUND: Chat Generative Pre-trained Transformer (ChatGPT); an open access artificial intelligence (AI) tool has been in the limelight with its ability to respond to prompts, analyse data information using algorithms to augment efficiency in day-to-day activities across a spectrum of human activities including MSK/Orthopaedic science. PURPOSE OF THE STUDY: The purpose of this cross-sectional survey has been to analyse the knowledge, understanding of the role of Chat Generative Pre-trained Transformer (ChatGPT) and its implications in clinical practice as well as research in medicine. MATERIAL & METHODS: An online cross-sectional survey of 10 questions (multiple choice and free text) was circulated amongst orthopaedic surgeons, musculoskeletal radiologists and Rheumatologists in India and UK, to evaluate perception of Chat Generative Pre-trained Transformer (ChatGPT) AI Tool. RESULTS: We had 125 responses with majority being aware of ChatGPT though a minority had used it. There was consensus that its going have detrimental effect on workforce with majority of the opinion that they would be used to create radiology reports. Mixed responses were noted regarding the quality of research and role of ChatGPT being an anonymous author. CONCLUSION: There is a considerable debate amongst clinicians of orthopaedic, radiology and rheumatology -specialities. The attitudes are mixed but mainly positive, although there are many concerns about the still-evolving new technology. LEVEL OF STUDY: Diagnostic Study level 4.",Iyengar KP; Yousef MMA; Nune A; Sharma GK; Botchu R
39536965,How soon will surgeons become mere technicians? Chatbot performance in managing clinical scenarios.,2024,The Journal of thoracic and cardiovascular surgery,,,,"BACKGROUND: Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons. METHODS: Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon scores and chatbot scores. RESULTS: Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; P = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; P = .016). CONCLUSIONS: Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.",Bryan DS; Platz JJ; Naunheim KS; Ferguson MK
38574939,Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios.,2024,The Annals of thoracic surgery,,,,"BACKGROUND: Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios. METHODS: Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of >/=2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of >/=30% as failure to pass. RESULTS: The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward ""learning"" from the first to the second run, but without improvement in overall raw (1.24 +/- 0.47 vs 1.63 +/- 0.76 vs 1.51 +/- 0.60; P = .29) and adjusted (0.44 +/- 0.54 vs 0.80 +/- 0.94 vs 0.76 +/- 0.81; P = .48) scenario scores after all runs. CONCLUSIONS: Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.",Platz JJ; Bryan DS; Naunheim KS; Ferguson MK
39132729,Performance of AI-powered chatbots in diagnosing acute pulmonary thromboembolism from given clinical vignettes.,2024,Acute medicine,,,,"BACKGROUND: Chatbots hold great potential to serve as support tool in diagnosis and clinical decision process. In this study, we aimed to evaluate the accuracy of chatbots in diagnosing pulmonary embolism (PE). Furthermore, we assessed their performance in determining the PE severity. METHOD: 65 case reports meeting our inclusion criteria were selected for this study. Two emergency medicine (EM) physicians crafted clinical vignettes and introduced them to the Bard, Bing, and ChatGPT-3.5 with asking the top 10 diagnoses. After obtaining all differential diagnoses lists, vignettes enriched with supplemental data redirected to the chatbots with asking the severity of PE. RESULTS: ChatGPT-3.5, Bing, and Bard listed PE within the top 10 diagnoses list with accuracy rates of 92.3%, 92.3%, and 87.6%, respectively. For the top 3 diagnoses, Bard achieved 75.4% accuracy, while ChatGPT and Bing both had 67.7%. As the top diagnosis, Bard, ChatGPT-3.5, and Bing were accurate in 56.9%, 47.7% and 30.8% cases, respectively. Significant differences between Bard and both Bing (p=0.000) and ChatGPT (p=0.007) were noted in this group. Massive PEs were correctly identified with over 85% success rate. Overclassification rates for Bard, ChatGPT-3.5 and Bing at 38.5%, 23.3% and 20%, respectively. Misclassification rates were highest in submassive group. CONCLUSION: Although chatbots aren't intended for diagnosis, their high level of diagnostic accuracy and success rate in identifying massive PE underscore the promising potential of chatbots as clinical decision support tool. However, further research with larger patient datasets is required to validate and refine their performance in real-world clinical settings.",Arslan B; Sutasir MN; Altinbilek E
37590034,Using ChatGPT as a Learning Tool in Acupuncture Education: Comparative Study.,2023,JMIR medical education,,,,"BACKGROUND: ChatGPT (Open AI) is a state-of-the-art artificial intelligence model with potential applications in the medical fields of clinical practice, research, and education. OBJECTIVE: This study aimed to evaluate the potential of ChatGPT as an educational tool in college acupuncture programs, focusing on its ability to support students in learning acupuncture point selection, treatment planning, and decision-making. METHODS: We collected case studies published in Acupuncture in Medicine between June 2022 and May 2023. Both ChatGPT-3.5 and ChatGPT-4 were used to generate suggestions for acupuncture points based on case presentations. A Wilcoxon signed-rank test was conducted to compare the number of acupuncture points generated by ChatGPT-3.5 and ChatGPT-4, and the overlapping ratio of acupuncture points was calculated. RESULTS: Among the 21 case studies, 14 studies were included for analysis. ChatGPT-4 generated significantly more acupuncture points (9.0, SD 1.1) compared to ChatGPT-3.5 (5.6, SD 0.6; P<.001). The overlapping ratios of acupuncture points for ChatGPT-3.5 (0.40, SD 0.28) and ChatGPT-4 (0.34, SD 0.27; P=.67) were not significantly different. CONCLUSIONS: ChatGPT may be a useful educational tool for acupuncture students, providing valuable insights into personalized treatment plans. However, it cannot fully replace traditional diagnostic methods, and further studies are needed to ensure its safe and effective implementation in acupuncture education.",Lee H
39284182,Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study.,2024,JMIR medical education,,,,"BACKGROUND: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. OBJECTIVE: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. METHODS: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4's problem-solving proficiency using both the original Korean texts and their English translations. RESULTS: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). CONCLUSIONS: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings.",Yoon SH; Oh SK; Lim BG; Lee HJ
37812998,Appraising the performance of ChatGPT in psychiatry using 100 clinical case vignettes.,2023,Asian journal of psychiatry,,,,"BACKGROUND: ChatGPT has emerged as the most advanced and rapidly developing large language chatbot system. With its immense potential ranging from answering a simple query to cracking highly competitive medical exams, ChatGPT continues to impress the scientists and researchers worldwide giving room for more discussions regarding its utility in various fields. One such field of attention is Psychiatry. With suboptimal diagnosis and treatment, assuring mental health and well-being is a challenge in many countries, particularly developing nations. To this regard, we conducted an evaluation to assess the performance of ChatGPT 3.5 in Psychiatry using clinical cases to provide evidence-based information regarding the implication of ChatGPT 3.5 in enhancing mental health and well-being. METHODS: ChatGPT 3.5 was used in this experimental study to initiate the conversations and collect responses to clinical vignettes in Psychiatry. Using 100 clinical case vignettes, the replies were assessed by expert faculties from the Department of Psychiatry. There were 100 different psychiatric illnesses represented in the cases. We recorded and assessed the initial ChatGPT 3.5 responses. The evaluation was conducted using the objective of questions that were put forth at the conclusion of the case, and the aim of the questions was divided into 10 categories. The grading was completed by taking the mean value of the scores provided by the evaluators. Graphs and tables were used to represent the grades. RESULTS: The evaluation report suggests that ChatGPT 3.5 fared extremely well in Psychiatry by receiving ""Grade A"" ratings in 61 out of 100 cases, ""Grade B"" ratings in 31, and ""Grade C"" ratings in 8. Majority of the queries were concerned with the management strategies, which were followed by diagnosis, differential diagnosis, assessment, investigation, counselling, clinical reasoning, ethical reasoning, prognosis, and request acceptance. ChatGPT 3.5 performed extremely well, especially in generating management strategies followed by diagnoses for different psychiatric conditions. There were no responses which were graded ""D"" indicating that there were no errors in the diagnosis or response for clinical care. Only a few discrepancies and additional details were missed in a few responses that received a ""Grade C"" CONCLUSION: It is evident from our study that ChatGPT 3.5 has appreciable knowledge and interpretation skills in Psychiatry. Thus, ChatGPT 3.5 undoubtedly has the potential to transform the field of Medicine and we emphasize its utility in Psychiatry through the finding of our study. However, for any AI model to be successful, assuring the reliability, validation of information, proper guidelines and implementation framework are necessary.",Franco D'Souza R; Amanullah S; Mathew M; Surapaneni KM
39314814,Exploring community pharmacists' attitudes in Thailand towards ChatGPT usage: A pilot qualitative investigation.,2024,Digital health,,,,"BACKGROUND: ChatGPT has recently emerged as a disruptive technology, potentially impacting various societal dimensions, including pharmacy practices. In Thailand, community pharmacists are navigating transitions as patients increasingly rely on digital tools for healthcare recommendations. This study explores the attitudes of community pharmacists in Hatyai, one of Thailand's most populated cities, towards the integration of ChatGPT in pharmacy services. METHOD: ChatGPT-3.5 was used to generate responses to three questions concerning the use of medicine in special populations in the Thai language. These responses were then incorporated into a questionnaire and evaluated using a Likert scale from 1 to 5. Participants who consented were asked to rate the responses and participate in an in-depth interview. RESULTS: The majority of participants rated the responses favorably, with scores of 4 and 5 accounting for at least 60% of the ratings. Only a small proportion of responses received doubtful ratings (score of 3) or was in disagreement, ranging from 20% to 40%. Moreover, open opinions extracted from the interviews suggested that participants viewed ChatGPT as a capable assistant, as it provided fast yet reasonably accurate information in the Thai language. CONCLUSION: The findings indicate that community pharmacists view ChatGPT as a capable assistant, albeit noting the need for further refinements. The study underscores the importance for pharmacists to proactively adapt to technological advancements, particularly those affecting patient safety, to enhance healthcare delivery and optimize treatment outcomes.",Boonrit N; Chaisawat K; Phueakong C; Nootong N; Ruanglertboon W
38729608,Unlocking the future of patient Education: ChatGPT vs. LexiComp(R) as sources of patient education materials.,2025,Journal of the American Pharmacists Association : JAPhA,,,,"BACKGROUND: ChatGPT is a conversational artificial intelligence technology that has shown application in various facets of healthcare. With the increased use of AI, it is imperative to assess the accuracy and comprehensibility of AI platforms. OBJECTIVE: This pilot project aimed to assess the understandability, readability, and accuracy of ChatGPT as a source of medication-related patient education as compared with an evidence-based medicine tertiary reference resource, LexiComp(R). METHODS: Patient education materials (PEMs) were obtained from ChatGPT and LexiComp(R) for 8 common medications (albuterol, apixaban, atorvastatin, hydrocodone/acetaminophen, insulin glargine, levofloxacin, omeprazole, and sacubitril/valsartan). PEMs were extracted, blinded, and assessed by 2 investigators independently. The primary outcome was a comparison of the Patient Education Materials Assessment Tool-printable (PEMAT-P). Secondary outcomes included Flesch reading ease, Flesch Kincaid grade level, percent passive sentences, word count, and accuracy. A 7-item accuracy checklist for each medication was generated by expert consensus among pharmacist investigators, with LexiComp(R) PEMs serving as the control. PEMAT-P interrater reliability was determined via intraclass correlation coefficient (ICC). Flesch reading ease, Flesch Kincaid grade level, percent passive sentences, and word count were calculated by Microsoft(R) Word(R). Continuous data were assessed using the Student's t-test via SPSS (version 20.0). RESULTS: No difference was found in the PEMAT-P understandability score of PEMs produced by ChatGPT versus LexiComp(R) [77.9% (11.0) vs. 72.5% (2.4), P=0.193]. Reading level was higher with ChatGPT [8.6 (1.2) vs. 5.6 (0.3), P < 0.001). ChatGPT PEMs had a lower percentage of passive sentences and lower word count. The average accuracy score of ChatGPT PEMs was 4.25/7 (61%), with scores ranging from 29% to 86%. CONCLUSION: Despite comparable PEMAT-P scores, ChatGPT PEMs did not meet grade level targets. Lower word count and passive text with ChatGPT PEMs could benefit patients, but the variable accuracy scores prevent routine use of ChatGPT to produce medication-related PEMs at this time.",Covington EW; Watts Alexander CS; Sewell J; Hutchison AM; Kay J; Tocco L; Hyte M
38684536,Performance of ChatGPT in Answering Clinical Questions on the Practical Guideline of Blepharoptosis.,2024,Aesthetic plastic surgery,,,,"BACKGROUND: ChatGPT is a free artificial intelligence (AI) language model developed and released by OpenAI in late 2022. This study aimed to evaluate the performance of ChatGPT to accurately answer clinical questions (CQs) on the Guideline for the Management of Blepharoptosis published by the American Society of Plastic Surgeons (ASPS) in 2022. METHODS: CQs in the guideline were used as question sources in both English and Japanese. For each question, ChatGPT provided answers for CQs, evidence quality, recommendation strength, reference match, and answered word counts. We compared the performance of ChatGPT in each component between English and Japanese queries. RESULTS: A total of 11 questions were included in the final analysis, and ChatGPT answered 61.3% of these correctly. ChatGPT demonstrated a higher accuracy rate in English answers for CQs compared to Japanese answers for CQs (76.4% versus 46.4%; p = 0.004) and word counts (123 words versus 35.9 words; p = 0.004). No statistical differences were noted for evidence quality, recommendation strength, and reference match. A total of 697 references were proposed, but only 216 of them (31.0%) existed. CONCLUSIONS: ChatGPT demonstrates potential as an adjunctive tool in the management of blepharoptosis. However, it is crucial to recognize that the existing AI model has distinct limitations, and its primary role should be to complement the expertise of medical professionals. LEVEL OF EVIDENCE V: Observational study under respected authorities. This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Shiraishi M; Tomioka Y; Miyakuni A; Ishii S; Hori A; Park H; Ohba J; Okazaki M
37634667,Chat GPT as a Neuro-Score Calculator: Analysis of a Large Language Model's Performance on Various Neurological Exam Grading Scales.,2023,World neurosurgery,,,,"BACKGROUND: ChatGPT is a large language model artificial intelligence chatbot that has been applied to different aspects of the medical field. Our study aims to assess the quality of chatGPT to evaluate patients based on their exams for different scores including Glasgow Coma Scale (GCS), intracranial hemorrhage score (ICH), and Hunt & Hess (H&H) classification. METHODS: We created batches of patient test cases with detailed neurological exams, totaling 20 cases and created variants of increasing complex phrasing of the test cases. Using ChatGPT, we assessed repeatability and quantified the errors, including the average error rate (AER) and magnitude of errors (AME). We repeated this process for the H&H and the ICH score using base cases. Specific prompts were created for each calculator. RESULTS: The GCS calculator on 10 base test cases had an AER/AME of 10%/0.150. The accuracy of ChatGPT decreased with increasing complexity; for example, in a variation where crucial information was missing, the AER was 45% for 20 cases. For H&H, AER/AME was 13%/0.13 and for ICH, AER/AME was 27.5%/0.325. Using a simple prompt resulted in a significantly higher error rate of 70%. CONCLUSIONS: ChatGPT demonstrates ability in this proof-of-concept experiment in evaluating neuroexams using established assessment scales including GCS, ICH, and H&H. However, it has limitations in accuracy and may ""hallucinate"" with complex or vague descriptions. Nonetheless, ChatGPT, has promising potential in medicine.",Chen TC; Kaminski E; Koduri L; Singer A; Singer J; Couldwell M; Delashaw J; Dumont A; Wang A
38573944,Performance of ChatGPT on Chinese Master's Degree Entrance Examination in Clinical Medicine.,2024,PloS one,,,,"BACKGROUND: ChatGPT is a large language model designed to generate responses based on a contextual understanding of user queries and requests. This study utilised the entrance examination for the Master of Clinical Medicine in Traditional Chinese Medicine to assesses the reliability and practicality of ChatGPT within the domain of medical education. METHODS: We selected 330 single and multiple-choice questions from the 2021 and 2022 Chinese Master of Clinical Medicine comprehensive examinations, which did not include any images or tables. To ensure the test's accuracy and authenticity, we preserved the original format of the query and alternative test texts, without any modifications or explanations. RESULTS: Both ChatGPT3.5 and GPT-4 attained average scores surpassing the admission threshold. Noteworthy is that ChatGPT achieved the highest score in the Medical Humanities section, boasting a correct rate of 93.75%. However, it is worth noting that ChatGPT3.5 exhibited the lowest accuracy percentage of 37.5% in the Pathology division, while GPT-4 also displayed a relatively lower correctness percentage of 60.23% in the Biochemistry section. An analysis of sub-questions revealed that ChatGPT demonstrates superior performance in handling single-choice questions but performs poorly in multiple-choice questions. CONCLUSION: ChatGPT exhibits a degree of medical knowledge and the capacity to aid in diagnosing and treating diseases. Nevertheless, enhancements are warranted to address its accuracy and reliability limitations. Imperatively, rigorous evaluation and oversight must accompany its utilization, accompanied by proactive measures to surmount prevailing constraints.",Li KC; Bu ZJ; Shahjalal M; He BX; Zhuang ZF; Li C; Liu JP; Wang B; Liu ZL
37548997,Performance of ChatGPT on the Situational Judgement Test-A Professional Dilemmas-Based Examination for Doctors in the United Kingdom.,2023,JMIR medical education,,,,"BACKGROUND: ChatGPT is a large language model that has performed well on professional examinations in the fields of medicine, law, and business. However, it is unclear how ChatGPT would perform on an examination assessing professionalism and situational judgement for doctors. OBJECTIVE: We evaluated the performance of ChatGPT on the Situational Judgement Test (SJT): a national examination taken by all final-year medical students in the United Kingdom. This examination is designed to assess attributes such as communication, teamwork, patient safety, prioritization skills, professionalism, and ethics. METHODS: All questions from the UK Foundation Programme Office's (UKFPO's) 2023 SJT practice examination were inputted into ChatGPT. For each question, ChatGPT's answers and rationales were recorded and assessed on the basis of the official UK Foundation Programme Office scoring template. Questions were categorized into domains of Good Medical Practice on the basis of the domains referenced in the rationales provided in the scoring sheet. Questions without clear domain links were screened by reviewers and assigned one or multiple domains. ChatGPT's overall performance, as well as its performance across the domains of Good Medical Practice, was evaluated. RESULTS: Overall, ChatGPT performed well, scoring 76% on the SJT but scoring full marks on only a few questions (9%), which may reflect possible flaws in ChatGPT's situational judgement or inconsistencies in the reasoning across questions (or both) in the examination itself. ChatGPT demonstrated consistent performance across the 4 outlined domains in Good Medical Practice for doctors. CONCLUSIONS: Further research is needed to understand the potential applications of large language models, such as ChatGPT, in medical education for standardizing questions and providing consistent rationales for examinations assessing professionalism and ethics.",Borchert RJ; Hickman CR; Pepys J; Sadler TJ
40354644,Global Health care Professionals' Perceptions of Large Language Model Use In Practice: Cross-Sectional Survey Study.,2025,JMIR medical education,,,,"BACKGROUND: ChatGPT is a large language model-based chatbot developed by OpenAI. ChatGPT has many potential applications to health care, including enhanced diagnostic accuracy and efficiency, improved treatment planning, and better patient outcomes. However, health care professionals' perceptions of ChatGPT and similar artificial intelligence tools are not well known. Understanding these attitudes is important to inform the best approaches to exploring their use in medicine. OBJECTIVE: Our aim was to evaluate the health care professionals' awareness and perceptions regarding potential applications of ChatGPT in the medical field, including potential benefits and challenges of adoption. METHODS: We designed a 33-question online survey that was distributed among health care professionals via targeted emails and professional Twitter and LinkedIn accounts. The survey included a range of questions to define respondents' demographic characteristics, familiarity with ChatGPT, perceptions of this tool's usefulness and reliability, and opinions on its potential to improve patient care, research, and education efforts. RESULTS: One hundred and fifteen health care professionals from 21 countries responded to the survey, including physicians, nurses, researchers, and educators. Of these, 101 (87.8%) had heard of ChatGPT, mainly from peers, social media, and news, and 77 (76.2%) had used ChatGPT at least once. Participants found ChatGPT to be helpful for writing manuscripts (n=31, 45.6%), emails (n=25, 36.8%), and grants (n=12, 17.6%); accessing the latest research and evidence-based guidelines (n=21, 30.9%); providing suggestions on diagnosis or treatment (n=15, 22.1%); and improving patient communication (n=12, 17.6%). Respondents also felt that the ability of ChatGPT to access and summarize research articles (n=22, 46.8%), provide quick answers to clinical questions (n=15, 31.9%), and generate patient education materials (n=10, 21.3%) was helpful. However, there are concerns regarding the use of ChatGPT, for example, the accuracy of responses (n=14, 29.8%), limited applicability in specific practices (n=18, 38.3%), and legal and ethical considerations (n=6, 12.8%), mainly related to plagiarism or copyright violations. Participants stated that safety protocols such as data encryption (n=63, 62.4%) and access control (n=52, 51.5%) could assist in ensuring patient privacy and data security. CONCLUSIONS: Our findings show that ChatGPT use is widespread among health care professionals in daily clinical, research, and educational activities. The majority of our participants found ChatGPT to be useful; however, there are concerns about patient privacy, data security, and its legal and ethical issues as well as the accuracy of its information. Further studies are required to understand the impact of ChatGPT and other large language models on clinical, educational, and research outcomes, and the concerns regarding its use must be addressed systematically and through appropriate methods.",Ozkan E; Tekin A; Ozkan MC; Cabrera D; Niven A; Dong Y
37851495,Health Care Trainees' and Professionals' Perceptions of ChatGPT in Improving Medical Knowledge Training: Rapid Survey Study.,2023,Journal of medical Internet research,,,,"BACKGROUND: ChatGPT is a powerful pretrained large language model. It has both demonstrated potential and raised concerns related to knowledge translation and knowledge transfer. To apply and improve knowledge transfer in the real world, it is essential to assess the perceptions and acceptance of the users of ChatGPT-assisted training. OBJECTIVE: We aimed to investigate the perceptions of health care trainees and professionals on ChatGPT-assisted training, using biomedical informatics as an example. METHODS: We used purposeful sampling to include all health care undergraduate trainees and graduate professionals (n=195) from January to May 2023 in the School of Public Health at the National Defense Medical Center in Taiwan. Subjects were asked to watch a 2-minute video introducing 5 scenarios about ChatGPT-assisted training in biomedical informatics and then answer a self-designed online (web- and mobile-based) questionnaire according to the Kirkpatrick model. The survey responses were used to develop 4 constructs: ""perceived knowledge acquisition,"" ""perceived training motivation,"" ""perceived training satisfaction,"" and ""perceived training effectiveness."" The study used structural equation modeling (SEM) to evaluate and test the structural model and hypotheses. RESULTS: The online questionnaire response rate was 152 of 195 (78%); 88 of 152 participants (58%) were undergraduate trainees and 90 of 152 participants (59%) were women. The ages ranged from 18 to 53 years (mean 23.3, SD 6.0 years). There was no statistical difference in perceptions of training evaluation between men and women. Most participants were enthusiastic about the ChatGPT-assisted training, while the graduate professionals were more enthusiastic than undergraduate trainees. Nevertheless, some concerns were raised about potential cheating on training assessment. The average scores for knowledge acquisition, training motivation, training satisfaction, and training effectiveness were 3.84 (SD 0.80), 3.76 (SD 0.93), 3.75 (SD 0.87), and 3.72 (SD 0.91), respectively (Likert scale 1-5: strongly disagree to strongly agree). Knowledge acquisition had the highest score and training effectiveness the lowest. In the SEM results, training effectiveness was influenced predominantly by knowledge acquisition and partially met the hypotheses in the research framework. Knowledge acquisition had a direct effect on training effectiveness, training satisfaction, and training motivation, with beta coefficients of .80, .87, and .97, respectively (all P<.001). CONCLUSIONS: Most health care trainees and professionals perceived ChatGPT-assisted training as an aid in knowledge transfer. However, to improve training effectiveness, it should be combined with empirical experts for proper guidance and dual interaction. In a future study, we recommend using a larger sample size for evaluation of internet-connected large language models in medical knowledge transfer.",Hu JM; Liu FC; Chu CM; Chang YT
38912370,ChatGPT Is Moderately Accurate in Providing a General Overview of Orthopaedic Conditions.,2024,JB & JS open access,,,,"BACKGROUND: ChatGPT is an artificial intelligence chatbot capable of providing human-like responses for virtually every possible inquiry. This advancement has provoked public interest regarding the use of ChatGPT, including in health care. The purpose of the present study was to investigate the quantity and accuracy of ChatGPT outputs for general patient-focused inquiries regarding 40 orthopaedic conditions. METHODS: For each of the 40 conditions, ChatGPT (GPT-3.5) was prompted with the text ""I have been diagnosed with [condition]. Can you tell me more about it?"" The numbers of treatment options, risk factors, and symptoms given for each condition were compared with the number in the corresponding American Academy of Orthopaedic Surgeons (AAOS) OrthoInfo website article for information quantity assessment. For accuracy assessment, an attending orthopaedic surgeon ranked the outputs in the categories of <50%, 50% to 74%, 75% to 99%, and 100% accurate. An orthopaedics sports medicine fellow also independently ranked output accuracy. RESULTS: Compared with the AAOS OrthoInfo website, ChatGPT provided significantly fewer treatment options (mean difference, -2.5; p < 0.001) and risk factors (mean difference, -1.1; p = 0.02) but did not differ in the number of symptoms given (mean difference, -0.5; p = 0.31). The surgical treatment options given by ChatGPT were often nondescript (n = 20 outputs), such as ""surgery"" as the only operative treatment option. Regarding accuracy, most conditions (26 of 40; 65%) were ranked as mostly (75% to 99%) accurate, with the others (14 of 40; 35%) ranked as moderately (50% to 74%) accurate, by an attending surgeon. Neither surgeon ranked any condition as mostly inaccurate (<50% accurate). Interobserver agreement between accuracy ratings was poor (kappa = 0.03; p = 0.30). CONCLUSIONS: ChatGPT provides at least moderately accurate outputs for general inquiries of orthopaedic conditions but is lacking in the quantity of information it provides for risk factors and treatment options. Professional organizations, such as the AAOS, are the preferred source of musculoskeletal information when compared with ChatGPT. CLINICAL RELEVANCE: ChatGPT is an emerging technology with potential roles and limitations in patient education that are still being explored.",Sparks CA; Fasulo SM; Windsor JT; Bankauskas V; Contrada EV; Kraeutler MJ; Scillia AJ
38434792,Performance of ChatGPT on the National Korean Occupational Therapy Licensing Examination.,2024,Digital health,,,,"BACKGROUND: ChatGPT is an artificial intelligence-based large language model (LLM). ChatGPT has been widely applied in medicine, but its application in occupational therapy has been lacking. OBJECTIVE: This study examined the accuracy of ChatGPT on the National Korean Occupational Therapy Licensing Examination (NKOTLE) and investigated its potential for application in the field of occupational therapy. METHODS: ChatGPT 3.5 was used during the five years of the NKOTLE with Korean prompts. Multiple choice questions were entered manually by three dependent encoders, and scored according to the number of correct answers. RESULTS: During the most recent five years, ChatGPT did not achieve a passing score of 60% accuracy and exhibited interrater agreement of 0.6 or higher. CONCLUSION: ChatGPT could not pass the NKOTLE but demonstrated a high level of agreement between raters. Even though the potential of ChatGPT to pass the NKOTLE is currently inadequate, it performed very close to the passing level even with only Korean prompts.",Lee SA; Heo S; Park JH
38230387,Utilizing ChatGPT in Telepharmacy.,2024,Cureus,,,,"BACKGROUND: ChatGPT is an artificial intelligence-powered chatbot that has demonstrated capabilities in numerous fields, including medical and healthcare sciences. This study evaluates the potential for ChatGPT application in telepharmacy, the delivering of pharmaceutical care via means of telecommunications, through assessing its interactions, adherence to instructions, and ability to role-play as a pharmacist while handling a series of life-like scenario questions. METHODS: Two versions (ChatGPT 3.5 and 4.0, OpenAI) were assessed using two independent trials each. ChatGPT was instructed to act as a pharmacist and answer patient inquiries, followed by a set of 20 assessment questions. Then, ChatGPT was instructed to stop its act, provide feedback and list its sources for drug information. The responses to the assessment questions were evaluated in terms of accuracy, precision and clarity using a 4-point Likert-like scale. RESULTS: ChatGPT demonstrated the ability to follow detailed instructions, role-play as a pharmacist, and appropriately handle all questions. ChatGPT was able to understand case details, recognize generic and brand drug names, identify drug side effects, interactions, prescription requirements and precautions, and provide proper point-by-point instructions regarding administration, dosing, storage and disposal. The overall means of pooled scores were 3.425 (0.712) and 3.7 (0.61) for ChatGPT 3.5 and 4.0, respectively. The rank distribution of scores was not significantly different (P>0.05). None of the answers could be considered directly harmful or labeled as entirely or mostly incorrect, and most point deductions were due to other factors such as indecisiveness, adding immaterial information, missing certain considerations, or partial unclarity. The answers were similar in length across trials and appropriately concise. ChatGPT 4.0 showed superior performance, higher consistency, better character adherence and the ability to report various reliable information sources. However, it only allowed an input of 40 questions every three hours and provided inaccurate feedback regarding the number of assessed patients, compared to 3.5 which allowed unlimited input but was unable to provide feedback. CONCLUSIONS: Integrating ChatGPT in telepharmacy holds promising potential; however, a number of drawbacks are to be overcome in order to function effectively.",Bazzari FH; Bazzari AH
37314466,Evaluation of the Artificial Intelligence Chatbot on Breast Reconstruction and Its Efficacy in Surgical Research: A Case Study.,2023,Aesthetic plastic surgery,,,,"BACKGROUND: ChatGPT is an open-source artificial intelligence (AI) chatbot that uses deep learning to produce human-like text dialog. Its potential applications in the scientific community are vast; however, its efficacy on performing comprehensive literature searches, data analysis and report writing in aesthetic plastic surgery topics remains unknown. This study aims to evaluate both the accuracy and comprehensiveness of ChatGPT's responses to assess its suitability for use in aesthetic plastic surgery research. METHODS: Six questions were prompted to ChatGPT on post-mastectomy breast reconstruction. First two questions focused on the current evidence and options for breast reconstruction post-mastectomy, and remaining four questions focused specifically on autologous breast reconstruction. Using the Likert framework, the responses provided by ChatGPT were qualitatively assessed for accuracy and information content by two specialist plastic surgeons with extensive experience in the field. RESULTS: ChatGPT provided relevant, accurate information; however, it lacked depth. It could provide no more than a superficial overview in response to more esoteric questions and generated incorrect references. It created non-existent references, cited wrong journal and date, which poses a significant challenge in maintaining academic integrity and caution of its use in academia. CONCLUSION: While ChatGPT demonstrated proficiency in summarizing existing knowledge, it created fictitious references which poses a significant concern of its use in academia and healthcare. Caution should be exercised in interpreting its responses in the aesthetic plastic surgical field and should only be used for such with sufficient oversight. LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Xie Y; Seth I; Rozen WM; Hunter-Smith DJ
37095384,Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT.,2023,Aesthetic plastic surgery,,,,"BACKGROUND: ChatGPT is an open-source artificial large language model that uses deep learning to produce human-like text dialogue. This observational study evaluated the ability of ChatGPT to provide informative and accurate responses to a set of hypothetical questions designed to simulate an initial consultation about rhinoplasty. METHODS: Nine questions were prompted to ChatGPT on rhinoplasty. The questions were sourced from a checklist published by the American Society of Plastic Surgeons, and the responses were assessed for accessibility, informativeness, and accuracy by Specialist Plastic Surgeons with extensive experience in rhinoplasty. RESULTS: ChatGPT was able to provide coherent and easily comprehensible answers to the questions posed, demonstrating its understanding of natural language in a health-specific context. The responses emphasized the importance of an individualized approach, particularly in aesthetic plastic surgery. However, the study also highlighted ChatGPT's limitations in providing more detailed or personalized advice. CONCLUSION: Overall, the results suggest that ChatGPT has the potential to provide valuable information to patients in a medical context, particularly in situations where patients may be hesitant to seek advice from medical professionals or where access to medical advice is limited. However, further research is needed to determine the scope and limitations of AI language models in this domain and to assess the potential benefits and risks associated with their use. LEVEL OF EVIDENCE V: Observational study under respected authorities. This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Xie Y; Seth I; Hunter-Smith DJ; Rozen WM; Ross R; Lee M
39042885,Assessing ChatGPT's Competency in Addressing Interdisciplinary Inquiries on Chatbot Uses in Sports Rehabilitation: Simulation Study.,2024,JMIR medical education,,,,"BACKGROUND: ChatGPT showcases exceptional conversational capabilities and extensive cross-disciplinary knowledge. In addition, it can perform multiple roles in a single chat session. This unique multirole-playing feature positions ChatGPT as a promising tool for exploring interdisciplinary subjects. OBJECTIVE: The aim of this study was to evaluate ChatGPT's competency in addressing interdisciplinary inquiries based on a case study exploring the opportunities and challenges of chatbot uses in sports rehabilitation. METHODS: We developed a model termed PanelGPT to assess ChatGPT's competency in addressing interdisciplinary topics through simulated panel discussions. Taking chatbot uses in sports rehabilitation as an example of an interdisciplinary topic, we prompted ChatGPT through PanelGPT to role-play a physiotherapist, psychologist, nutritionist, artificial intelligence expert, and athlete in a simulated panel discussion. During the simulation, we posed questions to the panel while ChatGPT acted as both the panelists for responses and the moderator for steering the discussion. We performed the simulation using ChatGPT-4 and evaluated the responses by referring to the literature and our human expertise. RESULTS: By tackling questions related to chatbot uses in sports rehabilitation with respect to patient education, physiotherapy, physiology, nutrition, and ethical considerations, responses from the ChatGPT-simulated panel discussion reasonably pointed to various benefits such as 24/7 support, personalized advice, automated tracking, and reminders. ChatGPT also correctly emphasized the importance of patient education, and identified challenges such as limited interaction modes, inaccuracies in emotion-related advice, assurance of data privacy and security, transparency in data handling, and fairness in model training. It also stressed that chatbots are to assist as a copilot, not to replace human health care professionals in the rehabilitation process. CONCLUSIONS: ChatGPT exhibits strong competency in addressing interdisciplinary inquiry by simulating multiple experts from complementary backgrounds, with significant implications in assisting medical education.",McBee JC; Han DY; Liu L; Ma L; Adjeroh DA; Xu D; Hu G
37546795,Interdisciplinary Inquiry via PanelGPT: Application to Explore Chatbot Application in Sports Rehabilitation.,2023,medRxiv : the preprint server for health sciences,,,,"BACKGROUND: ChatGPT showcases exceptional conversational capabilities and extensive cross-disciplinary knowledge. In addition, it possesses the ability to perform multiple roles within a single chat session. This unique multi-role-playing feature positions ChatGPT as a promising tool to explore interdisciplinary subjects. OBJECTIVE: The study intended to guide ChatGPT for interdisciplinary exploration through simulated panel discussions. As a proof-of-concept, we employed this method to evaluate the advantages and challenges of using chatbots in sports rehabilitation. METHODS: We proposed a model termed PanelGPT to explore ChatGPTs' knowledge graph on interdisciplinary topics through simulated panel discussions. Applied to ""chatbots in sports rehabilitation"", ChatGPT role-played both the moderator and panelists, which included a physiotherapist, psychologist, nutritionist, AI expert, and an athlete. We act as the audience posed questions to the panel, with ChatGPT acting as both the panelists for responses and the moderator for hosting the discussion. We performed the simulation using the ChatGPT-4 model and evaluated the responses with existing literature and human expertise. RESULTS: Each simulation mimicked a real-life panel discussion: The moderator introduced the panel and posed opening/closing questions, to which all panelists responded. The experts engaged with each other to address inquiries from the audience, primarily from their respective fields of expertise. By tackling questions related to education, physiotherapy, physiology, nutrition, and ethical consideration, the discussion highlighted benefits such as 24/7 support, personalized advice, automated tracking, and reminders. It also emphasized the importance of user education and identified challenges such as limited interaction modes, inaccuracies in emotion-related advice, assurance on data privacy and security, transparency in data handling, and fairness in model training. The panelists reached a consensus that chatbots are designed to assist, not replace, human healthcare professionals in the rehabilitation process. CONCLUSIONS: Compared to a typical conversation with ChatGPT, the multi-perspective approach of PanelGPT facilitates a comprehensive understanding of an interdisciplinary topic by integrating insights from experts with complementary knowledge. Beyond addressing the exemplified topic of chatbots in sports rehabilitation, the model can be adapted to tackle a wide array of interdisciplinary topics within educational, research, and healthcare settings.",McBee JC; Han DY; Liu L; Ma L; Adjeroh DA; Xu D; Hu G
39137029,"Understanding Health Care Students' Perceptions, Beliefs, and Attitudes Toward AI-Powered Language Models: Cross-Sectional Study.",2024,JMIR medical education,,,,"BACKGROUND: ChatGPT was not intended for use in health care, but it has potential benefits that depend on end-user understanding and acceptability, which is where health care students become crucial. There is still a limited amount of research in this area. OBJECTIVE: The primary aim of our study was to assess the frequency of ChatGPT use, the perceived level of knowledge, the perceived risks associated with its use, and the ethical issues, as well as attitudes toward the use of ChatGPT in the context of education in the field of health. In addition, we aimed to examine whether there were differences across groups based on demographic variables. The second part of the study aimed to assess the association between the frequency of use, the level of perceived knowledge, the level of risk perception, and the level of perception of ethics as predictive factors for participants' attitudes toward the use of ChatGPT. METHODS: A cross-sectional survey was conducted from May to June 2023 encompassing students of medicine, nursing, dentistry, nutrition, and laboratory science across the Americas. The study used descriptive analysis, chi-square tests, and ANOVA to assess statistical significance across different categories. The study used several ordinal logistic regression models to analyze the impact of predictive factors (frequency of use, perception of knowledge, perception of risk, and ethics perception scores) on attitude as the dependent variable. The models were adjusted for gender, institution type, major, and country. Stata was used to conduct all the analyses. RESULTS: Of 2661 health care students, 42.99% (n=1144) were unaware of ChatGPT. The median score of knowledge was ""minimal"" (median 2.00, IQR 1.00-3.00). Most respondents (median 2.61, IQR 2.11-3.11) regarded ChatGPT as neither ethical nor unethical. Most participants (median 3.89, IQR 3.44-4.34) ""somewhat agreed"" that ChatGPT (1) benefits health care settings, (2) provides trustworthy data, (3) is a helpful tool for clinical and educational medical information access, and (4) makes the work easier. In total, 70% (7/10) of people used it for homework. As the perceived knowledge of ChatGPT increased, there was a stronger tendency with regard to having a favorable attitude toward ChatGPT. Higher ethical consideration perception ratings increased the likelihood of considering ChatGPT as a source of trustworthy health care information (odds ratio [OR] 1.620, 95% CI 1.498-1.752), beneficial in medical issues (OR 1.495, 95% CI 1.452-1.539), and useful for medical literature (OR 1.494, 95% CI 1.426-1.564; P<.001 for all results). CONCLUSIONS: Over 40% of American health care students (1144/2661, 42.99%) were unaware of ChatGPT despite its extensive use in the health field. Our data revealed the positive attitudes toward ChatGPT and the desire to learn more about it. Medical educators must explore how chatbots may be included in undergraduate health care education programs.",Cherrez-Ojeda I; Gallardo-Bastidas JC; Robles-Velasco K; Osorio MF; Velez Leon EM; Leon Velastegui M; Pauletto P; Aguilar-Diaz FC; Squassi A; Gonzalez Eras SP; Cordero Carrasco E; Chavez Gonzalez KL; Calderon JC; Bousquet J; Bedbrook A; Faytong-Haro M
37544801,Revolutionizing Healthcare with ChatGPT: An Early Exploration of an AI Language Model's Impact on Medicine at Large and its Role in Pediatric Surgery.,2023,Journal of pediatric surgery,,,,"BACKGROUND: ChatGPT, a natural language processing model, has shown great promise in revolutionizing the field of medicine. This paper presents a comprehensive evaluation of the transformative potential of OpenAI's ChatGPT on healthcare and scientific research, with an exploration on its prospective capacity to impact the field of pediatric surgery. METHODS: Through an extensive review of the literature, we illuminate ChatGPT's applications in clinical healthcare and medical research while presenting the ethical considerations surrounding its use. RESULTS: Our review reveals the exciting work done so far evaluating the numerous potential uses of ChatGPT in clinical medicine and medical research, but it also shows that significant research and advancements in natural language processing models are still needed. CONCLUSION: ChatGPT has immense promise in transforming how we provide healthcare and how we conduct research. Currently, further robust research on the safety, effectiveness, and ethical considerations of ChatGPT is greatly needed. LEVEL OF STUDY: V.",Xiao D; Meyers P; Upperman JS; Robinson JR
39121303,"ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction.",2024,Medicine,,,,"BACKGROUND: ChatGPT, a powerful AI language model, has gained increasing prominence in medicine, offering potential applications in healthcare, clinical decision support, patient communication, and medical research. This systematic review aims to comprehensively assess the applications of ChatGPT in healthcare education, research, writing, patient communication, and practice while also delineating potential limitations and areas for improvement. METHOD: Our comprehensive database search retrieved relevant papers from PubMed, Medline and Scopus. After the screening process, 83 studies met the inclusion criteria. This review includes original studies comprising case reports, analytical studies, and editorials with original findings. RESULT: ChatGPT is useful for scientific research and academic writing, and assists with grammar, clarity, and coherence. This helps non-English speakers and improves accessibility by breaking down linguistic barriers. However, its limitations include probable inaccuracy and ethical issues, such as bias and plagiarism. ChatGPT streamlines workflows and offers diagnostic and educational potential in healthcare but exhibits biases and lacks emotional sensitivity. It is useful in inpatient communication, but requires up-to-date data and faces concerns about the accuracy of information and hallucinatory responses. CONCLUSION: Given the potential for ChatGPT to transform healthcare education, research, and practice, it is essential to approach its adoption in these areas with caution due to its inherent limitations.",Fatima A; Shafique MA; Alam K; Fadlalla Ahmed TK; Mustafa MS
39285377,"Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis.",2024,BMC medical education,,,,"BACKGROUND: ChatGPT, a recently developed artificial intelligence (AI) chatbot, has demonstrated improved performance in examinations in the medical field. However, thus far, an overall evaluation of the potential of ChatGPT models (ChatGPT-3.5 and GPT-4) in a variety of national health licensing examinations is lacking. This study aimed to provide a comprehensive assessment of the ChatGPT models' performance in national licensing examinations for medical, pharmacy, dentistry, and nursing research through a meta-analysis. METHODS: Following the PRISMA protocol, full-text articles from MEDLINE/PubMed, EMBASE, ERIC, Cochrane Library, Web of Science, and key journals were reviewed from the time of ChatGPT's introduction to February 27, 2024. Studies were eligible if they evaluated the performance of a ChatGPT model (ChatGPT-3.5 or GPT-4); related to national licensing examinations in the fields of medicine, pharmacy, dentistry, or nursing; involved multiple-choice questions; and provided data that enabled the calculation of effect size. Two reviewers independently completed data extraction, coding, and quality assessment. The JBI Critical Appraisal Tools were used to assess the quality of the selected articles. Overall effect size and 95% confidence intervals [CIs] were calculated using a random-effects model. RESULTS: A total of 23 studies were considered for this review, which evaluated the accuracy of four types of national licensing examinations. The selected articles were in the fields of medicine (n = 17), pharmacy (n = 3), nursing (n = 2), and dentistry (n = 1). They reported varying accuracy levels, ranging from 36 to 77% for ChatGPT-3.5 and 64.4-100% for GPT-4. The overall effect size for the percentage of accuracy was 70.1% (95% CI, 65-74.8%), which was statistically significant (p < 0.001). Subgroup analyses revealed that GPT-4 demonstrated significantly higher accuracy in providing correct responses than its earlier version, ChatGPT-3.5. Additionally, in the context of health licensing examinations, the ChatGPT models exhibited greater proficiency in the following order: pharmacy, medicine, dentistry, and nursing. However, the lack of a broader set of questions, including open-ended and scenario-based questions, and significant heterogeneity were limitations of this meta-analysis. CONCLUSIONS: This study sheds light on the accuracy of ChatGPT models in four national health licensing examinations across various countries and provides a practical basis and theoretical support for future research. Further studies are needed to explore their utilization in medical and health education by including a broader and more diverse range of questions, along with more advanced versions of AI chatbots.",Jin HK; Lee HE; Kim E
39526823,Evaluating the performance and clinical decision-making impact of ChatGPT-4 in reproductive medicine.,2025,International journal of gynaecology and obstetrics: the official organ of the International Federation of Gynaecology and Obstetrics,,,,"BACKGROUND: ChatGPT, a sophisticated language model developed by OpenAI, has the potential to offer professional and patient-friendly support. We aimed to assess the accuracy and reproducibility of ChatGPT-4 in answering questions related to knowledge, management, and support within the field of reproductive medicine. METHODS: ChatGPT-4 was used to respond to queries sourced from a domestic attending physician examination database, as well as to address both local and international treatment guidelines within the field of reproductive medicine. Each response generated by ChatGPT-4 was independently evaluated by a trio of experts specializing in reproductive medicine. The experts used four qualitative measures-relevance, accuracy, completeness, and understandability-to assess each response. RESULTS: We found that ChatGPT-4 demonstrated extensive knowledge in reproductive medicine, with median scores for relevance, accuracy, completeness, and comprehensibility of objective questions being 4, 3.5, 3, and 3, respectively. However, the composite accuracy rate for multiple-choice questions was 63.38%. Significant discrepancies were observed among the three experts' scores across all four measures. Expert 1 generally provided higher and more consistent scores, while Expert 3 awarded lower scores for accuracy. ChatGPT-4's responses to both domestic and international guidelines showed varying levels of understanding, with a lack of knowledge on regional guideline variations. However, it offered practical and multifaceted advice regarding next steps and adjusting to new guidelines. CONCLUSIONS: We analyzed the strengths and limitations of ChatGPT-4's responses on the management of reproductive medicine and relevant support. ChatGPT-4 might serve as a supplementary informational tool for patients and physicians to improve outcomes in the field of reproductive medicine.",Chen R; Zeng D; Li Y; Huang R; Sun D; Li T
38335017,Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.,2024,JMIR medical education,,,,"BACKGROUND: ChatGPT, an artificial intelligence (AI) based on large-scale language models, has sparked interest in the field of health care. Nonetheless, the capabilities of AI in text comprehension and generation are constrained by the quality and volume of available training data for a specific language, and the performance of AI across different languages requires further investigation. While AI harbors substantial potential in medicine, it is imperative to tackle challenges such as the formulation of clinical care standards; facilitating cultural transitions in medical education and practice; and managing ethical issues including data privacy, consent, and bias. OBJECTIVE: The study aimed to evaluate ChatGPT's performance in processing Chinese Postgraduate Examination for Clinical Medicine questions, assess its clinical reasoning ability, investigate potential limitations with the Chinese language, and explore its potential as a valuable tool for medical professionals in the Chinese context. METHODS: A data set of Chinese Postgraduate Examination for Clinical Medicine questions was used to assess the effectiveness of ChatGPT's (version 3.5) medical knowledge in the Chinese language, which has a data set of 165 medical questions that were divided into three categories: (1) common questions (n=90) assessing basic medical knowledge, (2) case analysis questions (n=45) focusing on clinical decision-making through patient case evaluations, and (3) multichoice questions (n=30) requiring the selection of multiple correct answers. First of all, we assessed whether ChatGPT could meet the stringent cutoff score defined by the government agency, which requires a performance within the top 20% of candidates. Additionally, in our evaluation of ChatGPT's performance on both original and encoded medical questions, 3 primary indicators were used: accuracy, concordance (which validates the answer), and the frequency of insights. RESULTS: Our evaluation revealed that ChatGPT scored 153.5 out of 300 for original questions in Chinese, which signifies the minimum score set to ensure that at least 20% more candidates pass than the enrollment quota. However, ChatGPT had low accuracy in answering open-ended medical questions, with only 31.5% total accuracy. The accuracy for common questions, multichoice questions, and case analysis questions was 42%, 37%, and 17%, respectively. ChatGPT achieved a 90% concordance across all questions. Among correct responses, the concordance was 100%, significantly exceeding that of incorrect responses (n=57, 50%; P<.001). ChatGPT provided innovative insights for 80% (n=132) of all questions, with an average of 2.95 insights per accurate response. CONCLUSIONS: Although ChatGPT surpassed the passing threshold for the Chinese Postgraduate Examination for Clinical Medicine, its performance in answering open-ended medical questions was suboptimal. Nonetheless, ChatGPT exhibited high internal concordance and the ability to generate multiple insights in the Chinese language. Future research should investigate the language-based discrepancies in ChatGPT's performance within the health care context.",Yu P; Fang C; Liu X; Fu W; Ling J; Yan Z; Jiang Y; Cao Z; Wu M; Chen Z; Zhu W; Zhang Y; Abudukeremu A; Wang Y; Liu X; Wang J
37853081,Can ChatGPT be the Plastic Surgeon's New Digital Assistant? A Bibliometric Analysis and Scoping Review of ChatGPT in Plastic Surgery Literature.,2024,Aesthetic plastic surgery,,,,"BACKGROUND: ChatGPT, an artificial intelligence (AI) chatbot that uses natural language processing (NLP) to interact in a humanlike manner, has made significant contributions to various healthcare fields, including plastic surgery. However, its widespread use has raised ethical and security concerns. This study examines the presence of ChatGPT, an artificial intelligence (AI) chatbot, in the literature of plastic surgery. METHODS: A bibliometric analysis and scoping review of the ChatGPT plastic surgery literature were performed. PubMed was queried using the search term ""ChatGPT"" to identify all biomedical literature on ChatGPT, with only studies related to plastic, reconstructive, or aesthetic surgery topics being considered eligible for inclusion. RESULTS: The analysis included 30 out of 724 articles retrieved from PubMed, focusing on publications from December 2022 to July 2023. Four key areas of research emerged: applications in research/creation of original work, clinical application, surgical education, and ethics/commentary on previous studies. The versatility of ChatGPT in research, its potential in surgical education, and its role in enhancing patient education were explored. Ethical concerns regarding patient privacy, plagiarism, and the accuracy of information obtained from ChatGPT-generated sources were also highlighted. CONCLUSION: While ethical concerns persist, the study underscores the potential of ChatGPT in plastic surgery research and practice, emphasizing the need for careful utilization and collaboration to optimize its benefits while minimizing risks. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Liu HY; Alessandri-Bonetti M; Arellano JA; Egro FM
37352529,ChatGPT encounters multiple opportunities and challenges in neurosurgery.,2023,"International journal of surgery (London, England)",,,,"BACKGROUND: ChatGPT, powered by the GPT model and Transformer architecture, has demonstrated remarkable performance in the domains of medicine and healthcare, providing customized and informative responses. In our study, we investigated the potential of ChatGPT in the field of neurosurgery, focusing on its applications at the patient, neurosurgery student/resident, and neurosurgeon levels. METHOD: The authors conducted inquiries with ChatGPT from the viewpoints of patients, neurosurgery students/residents, and neurosurgeons, covering a range of topics, such as disease diagnosis, treatment options, prognosis, rehabilitation, and patient care. The authors also explored concepts related to neurosurgery, including fundamental principles and clinical aspects, as well as tools and techniques to enhance the skills of neurosurgery students/residents. Additionally, the authors examined disease-specific medical interventions and the decision-making processes involved in clinical practice. RESULTS: The authors received individual responses from ChatGPT, but they tended to be shallow and repetitive, lacking depth and personalization. Furthermore, ChatGPT may struggle to discern a patient's emotional state, hindering the establishment of rapport and the delivery of appropriate care. The language used in the medical field is influenced by technical and cultural factors, and biases in the training data can result in skewed or inaccurate responses. Additionally, ChatGPT's limitations include the inability to conduct physical examinations or interpret diagnostic images, potentially overlooking complex details and individual nuances in each patient's case. Moreover, its absence in the surgical setting limits its practical utility. CONCLUSION: Although ChatGPT is a powerful language model, it cannot substitute for the expertise and experience of trained medical professionals. It lacks the capability to perform physical examinations, make diagnoses, administer treatments, establish trust, provide emotional support, and assist in the recovery process. Moreover, the implementation of Artificial Intelligence in healthcare necessitates careful consideration of legal and ethical concerns. While recognizing the potential of ChatGPT, additional training with comprehensive data is necessary to fully maximize its capabilities.",Kuang YR; Zou MX; Niu HQ; Zheng BY; Zhang TL; Zheng BW
37389908,Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument.,2023,Journal of medical Internet research,,,,"BACKGROUND: ChatGPT-4 is the latest release of a novel artificial intelligence (AI) chatbot able to answer freely formulated and complex questions. In the near future, ChatGPT could become the new standard for health care professionals and patients to access medical information. However, little is known about the quality of medical information provided by the AI. OBJECTIVE: We aimed to assess the reliability of medical information provided by ChatGPT. METHODS: Medical information provided by ChatGPT-4 on the 5 hepato-pancreatico-biliary (HPB) conditions with the highest global disease burden was measured with the Ensuring Quality Information for Patients (EQIP) tool. The EQIP tool is used to measure the quality of internet-available information and consists of 36 items that are divided into 3 subsections. In addition, 5 guideline recommendations per analyzed condition were rephrased as questions and input to ChatGPT, and agreement between the guidelines and the AI answer was measured by 2 authors independently. All queries were repeated 3 times to measure the internal consistency of ChatGPT. RESULTS: Five conditions were identified (gallstone disease, pancreatitis, liver cirrhosis, pancreatic cancer, and hepatocellular carcinoma). The median EQIP score across all conditions was 16 (IQR 14.5-18) for the total of 36 items. Divided by subsection, median scores for content, identification, and structure data were 10 (IQR 9.5-12.5), 1 (IQR 1-1), and 4 (IQR 4-5), respectively. Agreement between guideline recommendations and answers provided by ChatGPT was 60% (15/25). Interrater agreement as measured by the Fleiss kappa was 0.78 (P<.001), indicating substantial agreement. Internal consistency of the answers provided by ChatGPT was 100%. CONCLUSIONS: ChatGPT provides medical information of comparable quality to available static internet information. Although currently of limited quality, large language models could become the future standard for patients and health care professionals to gather medical information.",Walker HL; Ghani S; Kuemmerli C; Nebiker CA; Muller BP; Raptis DA; Staubli SM
38465158,Enhancing Postoperative Cochlear Implant Care With ChatGPT-4: A Study on Artificial Intelligence (AI)-Assisted Patient Education and Support.,2024,Cureus,,,,"BACKGROUND: Cochlear implantation is a critical surgical intervention for patients with severe hearing loss. Postoperative care is essential for successful rehabilitation, yet access to timely medical advice can be challenging, especially in remote or resource-limited settings. Integrating advanced artificial intelligence (AI) tools like Chat Generative Pre-trained Transformer (ChatGPT)-4 in post-surgical care could bridge the patient education and support gap. AIM: This study aimed to assess the effectiveness of ChatGPT-4 as a supplementary information resource for postoperative cochlear implant patients. The focus was on evaluating the AI chatbot's ability to provide accurate, clear, and relevant information, particularly in scenarios where access to healthcare professionals is limited. MATERIALS AND METHODS: Five common postoperative questions related to cochlear implant care were posed to ChatGPT-4. The AI chatbot's responses were analyzed for accuracy, response time, clarity, and relevance. The aim was to determine whether ChatGPT-4 could serve as a reliable source of information for patients in need, especially if the patients could not reach out to the hospital or the specialists at that moment. RESULTS: ChatGPT-4 provided responses aligned with current medical guidelines, demonstrating accuracy and relevance. The AI chatbot responded to each query within seconds, indicating its potential as a timely resource. Additionally, the responses were clear and understandable, making complex medical information accessible to non-medical audiences. These findings suggest that ChatGPT-4 could effectively supplement traditional patient education, providing valuable support in postoperative care. CONCLUSION: The study concluded that ChatGPT-4 has significant potential as a supportive tool for cochlear implant patients post surgery. While it cannot replace professional medical advice, ChatGPT-4 can provide immediate, accessible, and understandable information, which is particularly beneficial in special moments. This underscores the utility of AI in enhancing patient care and supporting cochlear implantation.",Aliyeva A; Sari E; Alaskarov E; Nasirov R
38420978,Performance of ChatGPT in Israeli Hebrew Internal Medicine National Residency Exam.,2024,The Israel Medical Association journal : IMAJ,,,,"BACKGROUND: Completing internal medicine specialty training in Israel involves passing the Israel National Internal Medicine Exam (Shlav Aleph), a challenging multiple-choice test. multiple-choice test. Chat generative pre-trained transformer (ChatGPT) 3.5, a language model, is increasingly used for exam preparation. OBJECTIVES: To assess the ability of ChatGPT 3.5 to pass the Israel National Internal Medicine Exam in Hebrew. METHODS: Using the 2023 Shlav Aleph exam questions, ChatGPT received prompts in Hebrew. Textual questions were analyzed after the appeal, comparing its answers to the official key. RESULTS: ChatGPT 3.5 correctly answered 36.6% of the 133 analyzed questions, with consistent performance across topics, except for challenges in nephrology and biostatistics. CONCLUSIONS: While ChatGPT 3.5 has excelled in English medical exams, its performance in the Hebrew Shlav Aleph was suboptimal. Factors include limited training data in Hebrew, translation complexities, and unique language structures. Further investigation is essential for its effective adaptation to Hebrew medical exam preparation.",Ozeri DJ; Cohen A; Bacharach N; Ukashi O; Oppenheim A
39911377,Evaluation of GPT-4 concordance with north American spine society guidelines for lumbar fusion surgery.,2025,North American Spine Society journal,,,,"BACKGROUND: Concordance with evidence-based medicine (EBM) guidelines is associated with improved clinical outcomes in spine surgery. The North American Spine Society (NASS) has published coverage guidelines on indications for lumbar fusion surgery, with a recent survey demonstrating a 60% concordance rate across its members. GPT-4 is a popular deep learning model that receives knowledge training across public databases including those containing EBM guidelines. There is prior research exploring the potential utility of artificial intelligence (AI) software in adherence with spine surgery practices and guidelines, inviting opportunity to further investigate application in the setting of lumbar fusion surgery with current AI models. METHODS: Seventeen well-validated clinical vignettes with specific indications for or against lumbar fusion based on NASS criteria were obtained from a prior published research study. Each case was transcribed into a standardized prompt and entered into GPT-4 to obtain a decision whether fusion is indicated. Interquery reliability was assessed with serial identical queries utilizing the Fleiss' Kappa statistic. Majority response among serial queries was considered as the final GPT-4 decision. Queries were all entered in separate strings. The investigator entering the prompts was blinded to the NASS-concordant decisions for the cases prior to complete data collection. Decisions by GPT-4 and NASS guidelines were compared with Chi-square analysis. RESULTS: GPT-4 responses for 15/17 (88.2%) of the clinical vignettes were in concordance with NASS EBM lumbar fusion guidelines. There was a significant association in clinical decision-making when determining indication for spine fusion surgery between GPT-4 and NASS guidelines (chi(2) = 9.75; p<.01). There was substantial agreement among the sets of responses generated by GPT-4 for each clinical case (K = 0.71; p<.001). CONCLUSIONS: There is significant concordance between GPT-4 responses and NASS EBM indications for lumbar fusion surgery. AI and deep learning models may prove to be an effective adjunct tool for clinical decision-making within modern spine surgery practices.",Khoylyan A; Salvato J; Vazquez F; Girgis M; Tang A; Chen T
40059391,The Need to Improve the Medical Subject Headings (MeSH) and the Excerpta Medica Tree (EMTREE) Thesauri to Perform Systematic Review on Oral Potentially Malignant Disorders.,2025,Journal of oral pathology & medicine : official publication of the International Association of Oral Pathologists and the American Academy of Oral Pathology,,,,"BACKGROUND: Despite recent advancements in the understanding and classification of oral potentially malignant disorders (OPMD), their terminology remains inconsistent and heterogeneous throughout the scientific literature, thus affecting evidence-based decision-making relevant for clinical management of these disorders. Updating this classification represents a necessity to improve the indexing and retrieval of OPMD publications, in particular for systematic reviews and meta-analysis. METHODS: Through a critical appraisal of the Medical Subject Headings (MeSH) and Excerpta Medica Tree (EMTREE) thesauri, we assessed gaps in the indexing for OPMD literature and propose improvements for enhanced categorisation and retrieval. RESULTS: The present study identifies inconsistencies and limitations in the classification of these disorders across the major medical databases, which may be summarized in the following findings: a) The MeSH database lacks a dedicated subject heading for ""oral potentially malignant disorders""; b) EMTREE indexing is incomplete, with only 5 out of 11 recognised OPMD having corresponding terms; c) Incoherent controlled vocabulary mappings hinder systematic literature retrieval. CONCLUSION: To ensure accurate evidence synthesis, the authors recommend searching both PubMed and Embase for OPMD studies. Moreover, the use of Embase's PubMed query translator and Large Language Models, such as ChatGPT, may lead to retrieval biases due to indexing discrepancies, posing challenges for early-career researchers and students. We recommend introducing ""oral potentially malignant disorders"" as a standardised subject heading. Evidence-based medicine underpins clinical decision support systems, which rely on standardised clinical coding for reliable health information. Enhanced medical ontologies will facilitate structured clinical coding, ensuring interoperability and improving clinical decision support systems.",Caponio VCA; Musella G; Perez-Sayans M; Lo Muzio L; Amaral Mendes R; Lopez-Pintor RM
39941547,Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts.,2025,Journal of clinical medicine,,,,"Background: Despite the growing popularity of artificial intelligence (AI)-based systems such as ChatGPT, there is still little evidence of their effectiveness in audiology, particularly in pediatric audiology. The present study aimed to verify the performance of ChatGPT in this field, as assessed by both students and professionals, and to compare its Polish and English versions. Methods: ChatGPT was presented with 20 questions, which were posed twice, first in Polish and then in English. A group of 20 students and 16 professionals in the field of audiology and otolaryngology rated the answers on a Likert scale of 1 to 5 in terms of correctness, relevance, completeness, and linguistic accuracy. Both groups were also asked to assess the usefulness of ChatGPT as a source of information for patients, in educational settings for students, and in professional work. Results: Both students and professionals generally rated ChatGPT's responses to be satisfactory. For most of the questions, ChatGPT's responses were rated somewhat higher by the students than the professionals, although statistically significant differences were only evident for completeness and linguistic accuracy. Those who rated ChatGPT's responses more highly also rated its usefulness more highly. Conclusions: ChatGPT can possibly be used for quick information retrieval, especially by non-experts, but it lacks the depth and reliability required by professionals. The different ratings given by students and professionals, and its language dependency, indicate it works best as a supplementary tool, not as a replacement for verifiable sources, particularly in a healthcare setting.",Ratuszniak A; Gos E; Lorens A; Skarzynski PH; Skarzynski H; Jedrzejczak WW
37128784,Performance of ChatGPT on the Plastic Surgery Inservice Training Examination.,2023,Aesthetic surgery journal,,,,"BACKGROUND: Developed originally as a tool for resident self-evaluation, the Plastic Surgery Inservice Training Examination (PSITE) has become a standardized tool adopted by Plastic Surgery residency programs. The introduction of large language models (LLMs), such as ChatGPT (OpenAI, San Francisco, CA), has demonstrated the potential to help propel the field of Plastic Surgery. OBJECTIVES: The authors of this study wanted to assess whether or not ChatGPT could be utilized as a tool in resident education by assessing its accuracy on the PSITE. METHODS: Questions were obtained from the 2022 PSITE, which was present on the American Council of Academic Plastic Surgeons (ACAPS) website. Questions containing images or tables were carefully inspected and flagged before being inputted into ChatGPT. All responses by ChatGPT were qualified utilizing the properties of natural coherence. Responses that were found to be incorrect were divided into the following categories: logical, informational, or explicit fallacy. RESULTS: ChatGPT answered a total of 242 questions with an accuracy of 54.96%. The software incorporated logical reasoning in 88.8% of questions, internal information in 95.5% of questions, and external information in 92.1% of questions. When stratified by correct and incorrect responses, we determined that there was a statistically significant difference in ChatGPT's use of external information (P < .05). CONCLUSIONS: ChatGPT is a versatile tool that has the potential to impact resident education by providing general knowledge, clarifying information, providing case-based learning, and promoting evidence-based medicine. With advancements in LLM and artificial intelligence (AI), it is possible that ChatGPT may be an impactful tool for resident education within Plastic Surgery.",Gupta R; Herzog I; Park JB; Weisberger J; Firouzbakht P; Ocon V; Chao J; Lee ES; Mailey BA
39148849,Foundational model aided automatic high-throughput drug screening using self-controlled cohort study.,2024,medRxiv : the preprint server for health sciences,,,,"BACKGROUND: Developing medicine from scratch to governmental authorization and detecting adverse drug reactions (ADR) have barely been economical, expeditious, and risk-averse investments. The availability of large-scale observational healthcare databases and the popularity of large language models offer an unparalleled opportunity to enable automatic high-throughput drug screening for both repurposing and pharmacovigilance. OBJECTIVES: To demonstrate a general workflow for automatic high-throughput drug screening with the following advantages: (i) the association of various exposure on diseases can be estimated; (ii) both repurposing and pharmacovigilance are integrated; (iii) accurate exposure length for each prescription is parsed from clinical texts; (iv) intrinsic relationship between drugs and diseases are removed jointly by bioinformatic mapping and large language model - ChatGPT; (v) causal-wise interpretations for incidence rate contrasts are provided. METHODS: Using a self-controlled cohort study design where subjects serve as their own control group, we tested the intention-to-treat association between medications on the incidence of diseases. Exposure length for each prescription is determined by parsing common dosages in English free text into a structured format. Exposure period starts from initial prescription to treatment discontinuation. A same exposure length preceding initial treatment is the control period. Clinical outcomes and categories are identified using existing phenotyping algorithms. Incident rate ratios (IRR) are tested using uniformly most powerful (UMP) unbiased tests. RESULTS: We assessed 3,444 medications on 276 diseases on 6,613,198 patients from the Clinical Practice Research Datalink (CPRD), an UK primary care electronic health records (EHR) spanning from 1987 to 2018. Due to the built-in selection bias of self-controlled cohort studies, ingredients-disease pairs confounded by deterministic medical relationships are removed by existing map from RxNorm and nonexistent maps by calling ChatGPT. A total of 16,901 drug-disease pairs reveals significant risk reduction, which can be considered as candidates for repurposing, while a total of 11,089 pairs showed significant risk increase, where drug safety might be of a concern instead. CONCLUSIONS: This work developed a data-driven, nonparametric, hypothesis generating, and automatic high-throughput workflow, which reveals the potential of natural language processing in pharmacoepidemiology. We demonstrate the paradigm to a large observational health dataset to help discover potential novel therapies and adverse drug effects. The framework of this study can be extended to other observational medical databases.",Xu S; Cobzaru R; Finkelstein SN; Welsch RE; Ng K; Middleton L
39229463,Diagnostic performance of generative artificial intelligences for a series of complex case reports.,2024,Digital health,,,,"BACKGROUND: Diagnostic performance of generative artificial intelligences (AIs) using large language models (LLMs) across comprehensive medical specialties is still unknown. OBJECTIVE: We aimed to evaluate the diagnostic performance of generative AIs using LLMs in complex case series across comprehensive medical fields. METHODS: We analyzed published case reports from the American Journal of Case Reports from January 2022 to March 2023. We excluded pediatric cases and those primarily focused on management. We utilized three generative AIs to generate the top 10 differential-diagnosis (DDx) lists from case descriptions: the fourth-generation chat generative pre-trained transformer (ChatGPT-4), Google Gemini (previously Bard), and LLM Meta AI 2 (LLaMA2) chatbot. Two independent physicians assessed the inclusion of the final diagnosis in the lists generated by the AIs. RESULTS: Out of 557 consecutive case reports, 392 were included. The inclusion rates of the final diagnosis within top 10 DDx lists were 86.7% (340/392) for ChatGPT-4, 68.6% (269/392) for Google Gemini, and 54.6% (214/392) for LLaMA2 chatbot. The top diagnoses matched the final diagnoses in 54.6% (214/392) for ChatGPT-4, 31.4% (123/392) for Google Gemini, and 23.0% (90/392) for LLaMA2 chatbot. ChatGPT-4 showed higher diagnostic accuracy than Google Gemini (P < 0.001) and LLaMA2 chatbot (P < 0.001). Additionally, Google Gemini outperformed LLaMA2 chatbot within the top 10 DDx lists (P < 0.001) and as the top diagnosis (P = 0.010). CONCLUSIONS: This study demonstrated the diagnostic performance of generative AIs including ChatGPT-4, Google Gemini, and LLaMA2 chatbot. ChatGPT-4 exhibited higher diagnostic accuracy than the other platforms. These findings suggest the importance of understanding the differences in diagnostic performance among generative AIs, especially in complex case series across comprehensive medical fields, like general medicine.",Hirosawa T; Harada Y; Mizuta K; Sakamoto T; Tokumasu K; Shimizu T
39094112,Patient-Representing Population's Perceptions of GPT-Generated Versus Standard Emergency Department Discharge Instructions: Randomized Blind Survey Assessment.,2024,Journal of medical Internet research,,,,"BACKGROUND: Discharge instructions are a key form of documentation and patient communication in the time of transition from the emergency department (ED) to home. Discharge instructions are time-consuming and often underprioritized, especially in the ED, leading to discharge delays and possibly impersonal patient instructions. Generative artificial intelligence and large language models (LLMs) offer promising methods of creating high-quality and personalized discharge instructions; however, there exists a gap in understanding patient perspectives of LLM-generated discharge instructions. OBJECTIVE: We aimed to assess the use of LLMs such as ChatGPT in synthesizing accurate and patient-accessible discharge instructions in the ED. METHODS: We synthesized 5 unique, fictional ED encounters to emulate real ED encounters that included a diverse set of clinician history, physical notes, and nursing notes. These were passed to GPT-4 in Azure OpenAI Service (Microsoft) to generate LLM-generated discharge instructions. Standard discharge instructions were also generated for each of the 5 unique ED encounters. All GPT-generated and standard discharge instructions were then formatted into standardized after-visit summary documents. These after-visit summaries containing either GPT-generated or standard discharge instructions were randomly and blindly administered to Amazon MTurk respondents representing patient populations through Amazon MTurk Survey Distribution. Discharge instructions were assessed based on metrics of interpretability of significance, understandability, and satisfaction. RESULTS: Our findings revealed that survey respondents' perspectives regarding GPT-generated and standard discharge instructions were significantly (P=.01) more favorable toward GPT-generated return precautions, and all other sections were considered noninferior to standard discharge instructions. Of the 156 survey respondents, GPT-generated discharge instructions were assigned favorable ratings, ""agree"" and ""strongly agree,"" more frequently along the metric of interpretability of significance in discharge instruction subsections regarding diagnosis, procedures, treatment, post-ED medications or any changes to medications, and return precautions. Survey respondents found GPT-generated instructions to be more understandable when rating procedures, treatment, post-ED medications or medication changes, post-ED follow-up, and return precautions. Satisfaction with GPT-generated discharge instruction subsections was the most favorable in procedures, treatment, post-ED medications or medication changes, and return precautions. Wilcoxon rank-sum test of Likert responses revealed significant differences (P=.01) in the interpretability of significant return precautions in GPT-generated discharge instructions compared to standard discharge instructions but not for other evaluation metrics and discharge instruction subsections. CONCLUSIONS: This study demonstrates the potential for LLMs such as ChatGPT to act as a method of augmenting current documentation workflows in the ED to reduce the documentation burden of physicians. The ability of LLMs to provide tailored instructions for patients by improving readability and making instructions more applicable to patients could improve upon the methods of communication that currently exist.",Huang T; Safranek C; Socrates V; Chartash D; Wright D; Dilip M; Sangal RB; Taylor RA
40385316,Can large language models detect drug-drug interactions leading to adverse drug reactions?,2025,Therapeutic advances in drug safety,,,,"BACKGROUND: Drug-drug interactions (DDI) are an important cause of adverse drug reactions (ADRs). Could large language models (LLMs) serve as valuable tools for pharmacovigilance specialists in detecting DDIs that lead to ADR notifications? OBJECTIVE: To compare the performance of three LLMs (ChatGPT, Gemini, and Claude) in detecting and explaining clinically significant DDIs that have led to an ADR. DESIGN: Observational cross-sectional study. METHODS: We used the French National Pharmacovigilance Database to randomly extract Individual Case Safety Reports (ICSRs) of ADRs with DDI (positive controls) and ICSRs of ADRs without DDI (negative controls) registered in 2022. Interaction cases were classified by difficulty level (level-1 DDI being the easiest and level-2 DDI being the most difficult). We give each LLM (ChatGPT, Gemini, and Claude) the same prompt and case summary. Sensitivity, specificity, and F-measure were calculated for each LLM in detecting DDIs in the case summaries. RESULTS: We assessed 82 ICSRs with DDIs and 22 ICSRs without DDIs. Among ICSRs with DDIs, 37 involved level-1 DDIs, and 45 involved level-2 DDIs. Correct responses were more frequent for level-1 DDIs than for level-2 DDIs. Regardless of difficulty level, ChatGPT detected 99% of DDI cases, and Claude and Gemini detected 95%. The percentage of correct answers to all DDI-related questions was 66% for ChatGPT, 68% for Claude, and 33% for Gemini. ChatGPT and Claude produced comparable results and outperformed Gemini (F-measure between 0.83 and 0.85 for ChatGPT and Claude and 0.63-0.68 for Gemini) to detect drugs involved in DDI. All exhibited low specificity (ChatGPT 0.68, Claude 0.64, and Gemini 0.36) and reported nonexistent DDIs for negative controls. CONCLUSION: LLMs can detect DDIs leading to pharmacovigilance cases, but cannot reliably exclude DDIs in cases without interactions. Pharmacologists are crucial for assessing whether a DDI is implicated in an ADR.",Sicard J; Montastruc F; Achalme C; Jonville-Bera AP; Songue P; Babin M; Soeiro T; Schiro P; de Canecaude C; Barus R
39702867,Aligning Large Language Models with Humans: A Comprehensive Survey of ChatGPT's Aptitude in Pharmacology.,2025,Drugs,,,,"BACKGROUND: Due to the lack of a comprehensive pharmacology test set, evaluating the potential and value of large language models (LLMs) in pharmacology is complex and challenging. AIMS: This study aims to provide a test set reference for assessing the application potential of both general-purpose and specialized LLMs in pharmacology. METHODS: We constructed a pharmacology test set consisting of three tasks: drug information retrieval, lead compound structure optimization, and research trend summarization and analysis. Subsequently, we compared the performance of general-purpose LLMs GPT-3.5 and GPT-4 on this test set. RESULTS: The results indicate that GPT-3.5 and GPT-4 can better understand instructions for information retrieval, scheme optimization, and trend summarization in pharmacology, showing significant potential in basic pharmacology tasks, especially in areas such as drug pharmacological properties, pharmacokinetics, mode of action, and toxicity prediction. These general LLMs also effectively summarize the current challenges and future trends in this field, proving their valuable resource for interdisciplinary pharmacology researchers. However, the limitations of ChatGPT become evident when handling tasks such as drug identification queries, drug interaction information retrieval, and drug structure simulation optimization. It struggles to provide accurate interaction information for individual or specific drugs and cannot optimize specific drugs. This lack of depth in knowledge integration and analysis limits its application in scientific research and clinical exploration. CONCLUSION: Therefore, exploring retrieval-augmented generation (RAG) or integrating proprietary knowledge bases and knowledge graphs into pharmacology-oriented ChatGPT systems would yield favorable results. This integration will further optimize the potential of LLMs in pharmacology.",Zhang Y; Ren S; Wang J; Lu J; Wu C; He M; Liu X; Wu R; Zhao J; Zhan C; Du D; Zhan Z; Singla RK; Shen B
39507462,Artificial intelligence generates proficient Spanish obstetrics and gynecology counseling templates.,2024,AJOG global reports,,,,"BACKGROUND: Effective patient counseling in Obstetrics and gynecology is vital. Existing language barriers between Spanish-speaking patients and English-speaking providers may negatively impact patient understanding and adherence to medical recommendations, as language discordance between provider and patient has been associated with medication noncompliance, adverse drug events, and underuse of preventative care. Artificial intelligence large language models may be a helpful adjunct to patient care by generating counseling templates in Spanish. OBJECTIVES: The primary objective was to determine if large language models can generate proficient counseling templates in Spanish on obstetric and gynecology topics. Secondary objectives were to (1) compare the content, quality, and comprehensiveness of generated templates between different large language models, (2) compare the proficiency ratings among the large language model generated templates, and (3) assess which generated templates had potential for integration into clinical practice. STUDY DESIGN: Cross-sectional study using free open-access large language models to generate counseling templates in Spanish on select obstetrics and gynecology topics. Native Spanish-speaking practicing obstetricians and gynecologists, who were blinded to the source large language model for each template, reviewed and subjectively scored each template on its content, quality, and comprehensiveness and considered it for integration into clinical practice. Proficiency ratings were calculated as a composite score of content, quality, and comprehensiveness. A score of >4 was considered proficient. Basic inferential statistics were performed. RESULTS: All artificial intelligence large language models generated proficient obstetrics and gynecology counseling templates in Spanish, with Google Bard generating the most proficient template (p<0.0001) and outperforming the others in comprehensiveness (P=.03), quality (P=.04), and content (P=.01). Microsoft Bing received the lowest scores in these domains. Physicians were likely to be willing to incorporate the templates into clinical practice, with no significant discrepancy in the likelihood of integration based on the source large language model (P=.45). CONCLUSIONS: Large language models have potential to generate proficient obstetrics and gynecology counseling templates in Spanish, which physicians would integrate into their clinical practice. Google Bard scored the highest across all attributes. There is an opportunity to use large language models to try to mitigate the language barriers in health care. Future studies should assess patient satisfaction, understanding, and adherence to clinical plans following receipt of these counseling templates.",Solmonovich RL; Kouba I; Quezada O; Rodriguez-Ayala G; Rojas V; Bonilla K; Espino K; Bracero LA
38819879,Redefining Health Care Data Interoperability: Empirical Exploration of Large Language Models in Information Exchange.,2024,Journal of medical Internet research,,,,"BACKGROUND: Efficient data exchange and health care interoperability are impeded by medical records often being in nonstandardized or unstructured natural language format. Advanced language models, such as large language models (LLMs), may help overcome current challenges in information exchange. OBJECTIVE: This study aims to evaluate the capability of LLMs in transforming and transferring health care data to support interoperability. METHODS: Using data from the Medical Information Mart for Intensive Care III and UK Biobank, the study conducted 3 experiments. Experiment 1 assessed the accuracy of transforming structured laboratory results into unstructured format. Experiment 2 explored the conversion of diagnostic codes between the coding frameworks of the ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification), and Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) using a traditional mapping table and a text-based approach facilitated by the LLM ChatGPT. Experiment 3 focused on extracting targeted information from unstructured records that included comprehensive clinical information (discharge notes). RESULTS: The text-based approach showed a high conversion accuracy in transforming laboratory results (experiment 1) and an enhanced consistency in diagnostic code conversion, particularly for frequently used diagnostic names, compared with the traditional mapping approach (experiment 2). In experiment 3, the LLM showed a positive predictive value of 87.2% in extracting generic drug names. CONCLUSIONS: This study highlighted the potential role of LLMs in significantly improving health care data interoperability, demonstrated by their high accuracy and efficiency in data transformation and exchange. The LLMs hold vast potential for enhancing medical data exchange without complex standardization for medical terms and data structure.",Yoon D; Han C; Kim DW; Kim S; Bae S; Ryu JA; Choi Y
40374171,Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study.,2025,Journal of medical Internet research,,,,"BACKGROUND: Emergency departments (EDs) face significant challenges due to overcrowding, prolonged waiting times, and staff shortages, leading to increased strain on health care systems. Efficient triage systems and accurate departmental guidance are critical for alleviating these pressures. Recent advancements in large language models (LLMs), such as ChatGPT, offer potential solutions for improving patient triage and outpatient department selection in emergency settings. OBJECTIVE: The study aimed to assess the accuracy, consistency, and feasibility of GPT-4-based ChatGPT models (GPT-4o and GPT-4-Turbo) for patient triage using the Modified Early Warning Score (MEWS) and evaluate GPT-4o's ability to provide accurate outpatient department guidance based on simulated patient scenarios. METHODS: A 2-phase experimental study was conducted. In the first phase, 2 ChatGPT models (GPT-4o and GPT-4-Turbo) were evaluated for MEWS-based patient triage accuracy using 1854 simulated patient scenarios. Accuracy and consistency were assessed before and after prompt engineering. In the second phase, GPT-4o was tested for outpatient department selection accuracy using 264 scenarios sourced from the Chinese Medical Case Repository. Each scenario was independently evaluated by GPT-4o thrice. Data analyses included Wilcoxon tests, Kendall correlation coefficients, and logistic regression analyses. RESULTS: In the first phase, ChatGPT's triage accuracy, based on MEWS, improved following prompt engineering. Interestingly, GPT-4-Turbo outperformed GPT-4o. GPT-4-Turbo achieved an accuracy of 100% compared to GPT-4o's accuracy of 96.2%, despite GPT-4o initially showing better performance prior to prompt engineering. This finding suggests that GPT-4-Turbo may be more adaptable to prompt optimization. In the second phase, GPT-4o, with superior performance on emotional responsiveness compared to GPT-4-Turbo, demonstrated an overall guidance accuracy of 92.63% (95% CI 90.34%-94.93%), with the highest accuracy in internal medicine (93.51%, 95% CI 90.85%-96.17%) and the lowest in general surgery (91.46%, 95% CI 86.50%-96.43%). CONCLUSIONS: ChatGPT demonstrated promising capability for supporting patient triage and outpatient guidance in EDs. GPT-4-Turbo showed greater adaptability to prompt engineering, whereas GPT-4o exhibited superior responsiveness and emotional interaction, which are essential for patient-facing tasks. Future studies should explore real-world implementation and address the identified limitations to enhance ChatGPT's clinical integration.",Wang C; Wang F; Li S; Ren QW; Tan X; Fu Y; Liu D; Qian G; Cao Y; Yin R; Li K
38432929,Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations.,2024,Journal of Nippon Medical School = Nippon Ika Daigaku zasshi,,,,"BACKGROUND: Emergency physicians need a broad range of knowledge and skills to address critical medical, traumatic, and environmental conditions. Artificial intelligence (AI), including large language models (LLMs), has potential applications in healthcare settings; however, the performance of LLMs in emergency medicine remains unclear. METHODS: To evaluate the reliability of information provided by ChatGPT, an LLM was given the questions set by the Japanese Association of Acute Medicine in its board certification examinations over a period of 5 years (2018-2022) and programmed to answer them twice. Statistical analysis was used to assess agreement of the two responses. RESULTS: The LLM successfully answered 465 of the 475 text-based questions, achieving an overall correct response rate of 62.3%. For questions without images, the rate of correct answers was 65.9%. For questions with images that were not explained to the LLM, the rate of correct answers was only 52.0%. The annual rates of correct answers to questions without images ranged from 56.3% to 78.8%. Accuracy was better for scenario-based questions (69.1%) than for stand-alone questions (62.1%). Agreement between the two responses was substantial (kappa = 0.70). Factual error accounted for 82% of the incorrectly answered questions. CONCLUSION: An LLM performed satisfactorily on an emergency medicine board certification examination in Japanese and without images. However, factual errors in the responses highlight the need for physician oversight when using LLMs.",Igarashi Y; Nakahara K; Norii T; Miyake N; Tagami T; Yokobori S
39137031,Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study.,2024,JMIR medical education,,,,"BACKGROUND: Evaluating the accuracy and educational utility of artificial intelligence-generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored. OBJECTIVE: This study aimed to assess the educational utility of ChatGPT-4-generated clinical vignettes and their applicability in educational settings. METHODS: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians' experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases. RESULTS: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations. CONCLUSIONS: ChatGPT-4-generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4's value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application.",Takahashi H; Shikino K; Kondo T; Komori A; Yamada Y; Saita M; Naito T
39823287,Performance of ChatGPT-4o in the diagnostic workup of fever among returning travellers requiring hospitalization: a validation study.,2025,Journal of travel medicine,,,,"BACKGROUND: Febrile illness in returned travellers presents a diagnostic challenge in non-endemic settings. Chat generative pretrained transformer (ChatGPT) has the potential to assist in medical tasks, yet its diagnostic performance in clinical settings has rarely been evaluated. We conducted a validation assessment of ChatGPT-4o's performance in the workup of fever in returning travellers. METHODS: We retrieved the medical records of returning travellers hospitalized with fever during 2009-2024. Their clinical scenarios at time of presentation to the emergency department were prompted to ChatGPT-4o, using a detailed uniform format. The model was further prompted with four consistent questions concerning the differential diagnosis and recommended workup. To avoid training, we kept the model blinded to the final diagnosis. Our primary outcome was ChatGPT-4o's success rates in predicting the final diagnosis when requested to specify the top three differential diagnoses. Secondary outcomes were success rates when prompted to specify the single most likely diagnosis, and all necessary diagnostics. We also assessed ChatGPT-4o as a predicting tool for malaria and qualitatively evaluated its failures. RESULTS: ChatGPT-4o predicted the final diagnosis in 68% [95% confidence interval (CI) 59-77%], 78% (95% CI 69-85%) and 83% (95% CI 74-89%) of the 114 cases, when prompted to specify the most likely diagnosis, top three diagnoses and all possible diagnoses, respectively. ChatGPT-4o showed a sensitivity of 100% (95% CI 93-100%) and a specificity of 94% (95% CI 85-98%) for predicting malaria. The model failed to provide the final diagnosis in 18% (20/114) of cases, primarily by failing to predict globally endemic infections (16/21, 76%). CONCLUSIONS: ChatGPT-4o demonstrated high diagnostic accuracy when prompted with real-life scenarios of febrile returning travellers presenting to the emergency department, especially for malaria. Model training is expected to yield an improved performance and facilitate diagnostic decision-making in the field.",Yelin D; Shirin N; Harris I; Peretz Y; Yahav D; Schwartz E; Leshem E; Margalit I
39974103,The Clinical Value of ChatGPT for Epilepsy Presurgical Decision Making: Systematic Evaluation on Seizure Semiology Interpretation.,2025,medRxiv : the preprint server for health sciences,,,,"BACKGROUND: For patients with drug-resistant focal epilepsy (DRE), surgical resection of the epileptogenic zone (EZ) is an effective treatment to control seizures. Accurate localization of the EZ is crucial and is typically achieved through comprehensive presurgical approaches such as seizure semiology interpretation, electroencephalography (EEG), magnetic resonance imaging (MRI), and intracranial EEG (iEEG). However, interpreting seizure semiology poses challenges because it relies heavily on expert knowledge and is often based on inconsistent and incoherent descriptions, leading to variability and potential limitations in presurgical evaluation. To overcome these challenges, advanced technologies like large language models (LLMs)-with ChatGPT being a notable example-offer valuable tools for analyzing complex textual information, making them well-suited to interpret detailed seizure semiology descriptions and assist in accurately localizing the EZ. OBJECTIVE: This study evaluates the clinical value of ChatGPT in interpreting seizure semiology to localize EZs in presurgical assessments for patients with focal epilepsy and compares its performance with epileptologists. METHODS: Two data cohorts were compiled: a publicly sourced cohort consisting of 852 semiology-EZ pairs from 193 peer-reviewed journal publications and a private cohort of 184 semiology-EZ pairs collected from Far Eastern Memorial Hospital (FEMH) in Taiwan. ChatGPT was evaluated to predict the most likely EZ locations using two prompt methods: zero-shot prompting (ZSP) and few-shot prompting (FSP). To compare ChatGPT's performance, eight epileptologists were recruited to participate in an online survey to interpret 100 randomly selected semiology records. The responses from ChatGPT and the epileptologists were compared using three metrics: regional sensitivity (RSens), weighted sensitivity (WSens), and net positive inference rate (NPIR). RESULTS: In the publicly sourced cohort, ChatGPT demonstrated high RSens reliability, achieving 80-90% for the frontal and temporal lobes, 20-40% for the parietal lobe, occipital lobe, and insular cortex, and only 3% for the cingulate cortex. The WSens, which accounts for biased data distribution, consistently exceeded 67%, while the mean NPIR remained around 0. These evaluation results based on the private FEMH cohort are consistent with those from the publicly sourced cohort. A group t-test with 1000 bootstrap samples revealed that ChatGPT-4 significantly outperformed epileptologists in RSens for commonly represented EZs, such as the frontal and temporal lobes (p < 0.001). Additionally, ChatGPT-4 demonstrated superior overall performance in WSens (p < 0.001). However, no significant differences were observed between ChatGPT and the epileptologists in NPIR, highlighting comparable performance in this metric. CONCLUSIONS: ChatGPT demonstrated clinical value as a tool to assist the decision-making in the epilepsy preoperative workup. With ongoing advancements in LLMs, it is anticipated that the reliability and accuracy of LLMs will continue to improve in the future.",Luo Y; Jiao M; Fotedar N; Ding JE; Karakis I; Rao VR; Asmar M; Xian X; Aboud O; Wen Y; Lin JJ; Hung FM; Sun H; Rosenow F; Liu F
40354107,Clinical Value of ChatGPT for Epilepsy Presurgical Decision-Making: Systematic Evaluation of Seizure Semiology Interpretation.,2025,Journal of medical Internet research,,,,"BACKGROUND: For patients with drug-resistant focal epilepsy, surgical resection of the epileptogenic zone (EZ) is an effective treatment to control seizures. Accurate localization of the EZ is crucial and is typically achieved through comprehensive presurgical approaches such as seizure semiology interpretation, electroencephalography (EEG), magnetic resonance imaging (MRI), and intracranial EEG (iEEG). However, interpreting seizure semiology is challenging because it heavily relies on expert knowledge. The semiologies are often inconsistent and incoherent, leading to variability and potential limitations in presurgical evaluation. To overcome these challenges, advanced technologies like large language models (LLMs)-with ChatGPT being a notable example-offer valuable tools for analyzing complex textual information, making them well-suited to interpret detailed seizure semiology descriptions and accurately localize the EZ. OBJECTIVE: This study evaluates the clinical value of ChatGPT for interpreting seizure semiology to localize EZs in presurgical assessments for patients with focal epilepsy and compares its performance with that of epileptologists. METHODS: We compiled 2 data cohorts: a publicly sourced cohort of 852 semiology-EZ pairs from 193 peer-reviewed journal publications and a private cohort of 184 semiology-EZ pairs collected from Far Eastern Memorial Hospital (FEMH) in Taiwan. ChatGPT was evaluated to predict the most likely EZ locations using 2 prompt methods: zero-shot prompting (ZSP) and few-shot prompting (FSP). To compare the performance of ChatGPT, 8 epileptologists were recruited to participate in an online survey to interpret 100 randomly selected semiology records. The responses from ChatGPT and epileptologists were compared using 3 metrics: regional sensitivity (RSens), weighted sensitivity (WSens), and net positive inference rate (NPIR). RESULTS: In the publicly sourced cohort, ChatGPT demonstrated high RSens reliability, achieving 80% to 90% for the frontal and temporal lobes; 20% to 40% for the parietal lobe, occipital lobe, and insular cortex; and only 3% for the cingulate cortex. The WSens, which accounts for biased data distribution, consistently exceeded 67%, while the mean NPIR remained around 0. These evaluation results based on the private FEMH cohort are consistent with those from the publicly sourced cohort. A group t test with 1000 bootstrap samples revealed that ChatGPT-4 significantly outperformed epileptologists in RSens for the most frequently implicated EZs, such as the frontal and temporal lobes (P<.001). Additionally, ChatGPT-4 demonstrated superior overall performance in WSens (P<.001). However, no significant differences were observed between ChatGPT and the epileptologists in NPIR, highlighting comparable performance in this metric. CONCLUSIONS: ChatGPT demonstrated clinical value as a tool to assist decision-making during epilepsy preoperative workups. With ongoing advancements in LLMs, their reliability and accuracy are anticipated to improve.",Luo Y; Jiao M; Fotedar N; Ding JE; Karakis I; Rao VR; Asmar M; Xian X; Aboud O; Wen Y; Lin JJ; Hung FM; Sun H; Rosenow F; Liu F
39427271,Chasing sleep physicians: ChatGPT-4o on the interpretation of polysomnographic results.,2025,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"BACKGROUND: From a healthcare professional's perspective, the use of ChatGPT (Open AI), a large language model (LLM), offers huge potential as a practical and economic digital assistant. However, ChatGPT has not yet been evaluated for the interpretation of polysomnographic results in patients with suspected obstructive sleep apnea (OSA). AIMS/OBJECTIVES: To evaluate the agreement of polysomnographic result interpretation between ChatGPT-4o and a board-certified sleep physician and to shed light into the role of ChatGPT-4o in the field of medical decision-making in sleep medicine. MATERIAL AND METHODS: For this proof-of-concept study, 40 comprehensive patient profiles were designed, which represent a broad and typical spectrum of cases, ensuring a balanced distribution of demographics and clinical characteristics. After various prompts were tested, one prompt was used for initial diagnosis of OSA and a further for patients with positive airway pressure (PAP) therapy intolerance. Each polysomnographic result was independently evaluated by ChatGPT-4o and a board-certified sleep physician. Diagnosis and therapy suggestions were analyzed for agreement. RESULTS: ChatGPT-4o and the sleep physician showed 97% (29/30) concordance in the diagnosis of the simple cases. For the same cases the two assessment instances unveiled 100% (30/30) concordance regarding therapy suggestions. For cases with intolerance of treatment with positive airway pressure (PAP) ChatGPT-4o and the sleep physician revealed 70% (7/10) concordance in the diagnosis and 44% (22/50) concordance for therapy suggestions. CONCLUSION AND SIGNIFICANCE: Precise prompting improves the output of ChatGPT-4o and provides sleep physician-like polysomnographic result interpretation. Although ChatGPT shows some shortcomings in offering treatment advice, our results provide evidence for AI assisted automation and economization of polysomnographic interpretation by LLMs. Further research should explore data protection issues and demonstrate reproducibility with real patient data on a larger scale.",Seifen C; Huppertz T; Gouveris H; Bahr-Hamm K; Pordzik J; Eckrich J; Smith H; Kelsey T; Blaikie A; Matthias C; Kuhn S; Buhr CR
38185435,A review of top cardiology and cardiovascular medicine journal guidelines regarding the use of generative artificial intelligence tools in scientific writing.,2024,Current problems in cardiology,,,,"BACKGROUND: Generative Artificial Intelligence (AI) tools have experienced rapid development over the last decade and are gaining increasing popularity as assistive models in academic writing. However, the ability of AI to generate reliable and accurate research articles is a topic of debate. Major scientific journals have issued policies regarding the contribution of AI tools in scientific writing. METHODS: We conducted a review of the author and peer reviewer guidelines of the top 25 Cardiology and Cardiovascular Medicine journals as per the 2023 SCImago rankings. Data were obtained though reviewing journal websites and directly emailing the editorial office. Descriptive data regarding journal characteristics were coded on SPSS. Subgroup analyses of the journal guidelines were conducted based on the publishing company policies. RESULTS: Our analysis revealed that all scientific journals in our study permitted the documented use of AI in scientific writing with certain limitations as per ICMJE recommendations. We found that AI tools cannot be included in the authorship or be used for image generation, and that all authors are required to assume full responsibility of their submitted and published work. The use of generative AI tools in the peer review process is strictly prohibited. CONCLUSION: Guidelines regarding the use of generative AI in scientific writing are standardized, detailed, and unanimously followed by all journals in our study according to the recommendations set forth by international forums. It is imperative to ensure that these policies are carefully followed and updated to maintain scientific integrity.",Inam M; Sheikh S; Minhas AMK; Vaughan EM; Krittanawong C; Samad Z; Lavie CJ; Khoja A; D'Cruze M; Slipczuk L; Alarakhiya F; Naseem A; Haider AH; Virani SS
38446539,Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study.,2024,JMIR human factors,,,,"BACKGROUND: Generative artificial intelligence has the potential to revolutionize health technology product development by improving coding quality, efficiency, documentation, quality assessment and review, and troubleshooting. OBJECTIVE: This paper explores the application of a commercially available generative artificial intelligence tool (ChatGPT) to the development of a digital health behavior change intervention designed to support patient engagement in a commercial digital diabetes prevention program. METHODS: We examined the capacity, advantages, and limitations of ChatGPT to support digital product idea conceptualization, intervention content development, and the software engineering process, including software requirement generation, software design, and code production. In total, 11 evaluators, each with at least 10 years of experience in fields of study ranging from medicine and implementation science to computer science, participated in the output review process (ChatGPT vs human-generated output). All had familiarity or prior exposure to the original personalized automatic messaging system intervention. The evaluators rated the ChatGPT-produced outputs in terms of understandability, usability, novelty, relevance, completeness, and efficiency. RESULTS: Most metrics received positive scores. We identified that ChatGPT can (1) support developers to achieve high-quality products faster and (2) facilitate nontechnical communication and system understanding between technical and nontechnical team members around the development goal of rapid and easy-to-build computational solutions for medical technologies. CONCLUSIONS: ChatGPT can serve as a usable facilitator for researchers engaging in the software development life cycle, from product conceptualization to feature identification and user story development to code generation. TRIAL REGISTRATION: ClinicalTrials.gov NCT04049500; https://clinicaltrials.gov/ct2/show/NCT04049500.",Rodriguez DV; Lawrence K; Gonzalez J; Brandfield-Harvey B; Xu L; Tasneem S; Levine DL; Mann D
39948214,Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases.,2025,Journal of medical systems,,,,"BACKGROUND: Generative large language models (LLMs) are increasingly integrated into the medical field. However, their actual efficacy in clinical decision-making remains partially unexplored. This study aimed to assess the performance of the three LLMs, ChatGPT-4, Gemini, and Med-Go, in the domain of professional medicine when confronted with actual clinical cases. METHODS: This study involved 134 clinical cases spanning nine medical disciplines. Each LLM was required to provide suggestions for diagnosis, diagnostic criteria, differential diagnosis, examination and treatment for every case. Responses were scored by two experts using a predefined rubric. RESULTS: In overall performance among the models, Med-Go achieved the highest median score (37.5, IQR 31.9-41.5), while Gemini recorded the lowest (33.0, IQR 25.5-36.6), showing significant statistical difference among the three LLMs (p < 0.001). Analysis revealed that responses related to differential diagnosis were the weakest, while those pertaining to treatment recommendations were the strongest. Med-Go displayed notable performance advantages in gastroenterology, nephrology, and neurology. CONCLUSIONS: The findings show that all three LLMs achieved over 60% of the maximum possible score, indicating their potential applicability in clinical practice. However, inaccuracies that could lead to adverse decisions underscore the need for caution in their application. Med-Go's superior performance highlights the benefits of incorporating specialized medical knowledge into LLMs training. It is anticipated that further development and refinement of medical LLMs will enhance their precision and safety in clinical use.",Wang X; Ye H; Zhang S; Yang M; Wang X
39396402,PresRecRF: Herbal prescription recommendation via the representation fusion of large TCM semantics and molecular knowledge.,2024,Phytomedicine : international journal of phytotherapy and phytopharmacology,,,,"BACKGROUND: Herbal prescription recommendation (HPR) is a hotspot in the research of clinical intelligent decision support. Recently plentiful HPR models based on deep neural networks have been proposed. Owing to insufficient data, e.g., lack of knowledge of molecular, TCM theory, and herbal dosage in HPR modeling, the existing models suffer from challenges, e.g., plain prediction precision, and are far from real-world clinics. PURPOSE: To address these problems, we proposed a novel herbal prescription recommendation model with the representation fusion of large TCM semantics and molecular knowledge (termed PresRecRF). STUDY DESIGN AND METHODS: PresRecRF comprises three key modules. The representation learning module consists of two key components: a molecular knowledge representation component, integrating molecular knowledge into the herb-symptom-protein knowledge graph to enhance representations for herbs and symptoms; and a TCM knowledge representation component, leveraging BERT and ChatGPT to acquire TCM knowledge-enriched semantic representations. We introduced a representation fusion module to effectively merge molecular and TCM semantic representations. In the herb recommendation module, a multi-task objective loss is implemented to predict both herbs and dosages simultaneously. RESULTS: The experimental results on two clinical datasets show that PresRecRF can achieve the optimal performance. Further analysis of ablation, hyper-parameters, and case studies indicate the effectiveness and reliability of the proposed model, suggesting that it can help precision medicine and treatment recommendations. CONCLUSION: The entire process of the proposed PresRecRF model closely mirrors the actual diagnosis and treatment procedures carried out by doctors, which are better applied in real clinical scenarios. The source codes of PresRecRF is available at https://github.com/2020MEAI/PresRecRF.",Yang K; Dong X; Zhang S; Yu H; Zhong L; Zhang L; Zhao H; Hou Y; Song X; Zhou X
39718328,"Artificial Intelligence, the ChatGPT Large Language Model: Assessing the Accuracy of Responses to the Gynaecological Endoscopic Surgical Education and Assessment (GESEA) Level 1-2 knowledge tests.",2024,"Facts, views & vision in ObGyn",,,,"BACKGROUND: In 2022, OpenAI launched ChatGPT 3.5, which is now widely used in medical education, training, and research. Despite its valuable use for the generation of information, concerns persist about its authenticity and accuracy. Its undisclosed information source and outdated dataset pose risks of misinformation. Although it is widely used, AI-generated text inaccuracies raise doubts about its reliability. The ethical use of such technologies is crucial to uphold scientific accuracy in research. OBJECTIVE: This study aimed to assess the accuracy of ChatGPT in doing GESEA tests 1 and 2. MATERIALS AND METHODS: The 100 multiple-choice theoretical questions from GESEA certifications 1 and 2 were presented to ChatGPT, requesting the selection of the correct answer along with an explanation. Expert gynaecologists evaluated and graded the explanations for accuracy. MAIN OUTCOME MEASURES: ChatGPT showed a 59% accuracy in responses, with 64% providing comprehensive explanations. It performed better in GESEA Level 1 (64% accuracy) than in GESEA Level 2 (54% accuracy) questions. CONCLUSIONS: ChatGPT is a versatile tool in medicine and research, offering knowledge, information, and promoting evidence-based practice. Despite its widespread use, its accuracy has not been validated yet. This study found a 59% correct response rate, highlighting the need for accuracy validation and ethical use considerations. Future research should investigate ChatGPT's truthfulness in subspecialty fields such as gynaecologic oncology and compare different versions of chatbot for continuous improvement. WHAT IS NEW? Artificial intelligence (AI) has a great potential in scientific research. However, the validity of outputs remains unverified. This study aims to evaluate the accuracy of responses generated by ChatGPT to enhance the critical use of this tool.",Pavone M; Palmieri L; Bizzarri N; Rosati A; Campolo F; Innocenzi C; Taliento C; Restaino S; Catena U; Vizzielli G; Akladios C; Ianieri MM; Marescaux J; Campo R; Fanfani F; Scambia G
40405178,Challenging cases of hyponatremia incorrectly interpreted by ChatGPT.,2025,BMC medical education,,,,"BACKGROUND: In clinical medicine, the assessment of hyponatremia is frequently required but also known as a source of major diagnostic errors, substantial mismanagement, and iatrogenic morbidity. Because artificial intelligence techniques are efficient in analyzing complex problems, their use may possibly overcome current assessment limitations. There is no literature concerning Chat Generative Pre-trained Transformer (ChatGPT-3.5) use for evaluating difficult hyponatremia cases. Because of the interesting pathophysiology, hyponatremia cases are often used in medical education for students to evaluate patients with students increasingly using artificial intelligence as a diagnostic tool. To evaluate this possibility, four challenging hyponatremia cases published previously, were presented to the free ChatGPT-3.5 for diagnosis and treatment suggestions. METHODS: We used four challenging hyponatremia cases, that were evaluated by 46 physicians in Canada, the Netherlands, South-Africa, Taiwan, and USA, and published previously. These four cases were presented two times in the free ChatGPT, version 3.5 in December 2023 as well as in September 2024 with the request to recommend diagnosis and therapy. Responses by ChatGPT were compared with those of the clinicians. RESULTS: Case 1 and 3 have a single cause of hyponatremia. Case 2 and 4 have two contributing hyponatremia features. Neither ChatGPT, in 2023, nor the previously published assessment by 46 clinicians, whose assessment was described in the original publication, recognized the most crucial cause of hyponatremia with major therapeutic consequences in all four cases. In 2024 ChatGPT properly diagnosed and suggested adequate management in one case. Concurrent Addison's disease was correctly recognized in case 1 by ChatGPT in 2023 and 2024, whereas 81% of the clinicians missed this diagnosis. No proper therapeutic recommendations were given by ChatGPT in 2023 in any of the four cases, but in one case adequate advice was given by ChatGPT in 2024. The 46 clinicians recommended inadequate therapy in 65%, 57%, 2%, and 76%, respectively in case 1 to 4. CONCLUSION: Our study currently does not support the use of the free version ChatGPT 3.5 in difficult hyponatremia cases, but a small improvement was observed after ten months with the same ChatGPT 3.5 version. Patients, health professionals, medical educators and students should be aware of the shortcomings of diagnosis and therapy suggestions by ChatGPT.",Berend K; Duits A; Gans ROB
40013072,Medication counseling for OTC drugs using customized ChatGPT-4: Comparison with ChatGPT-3.5 and ChatGPT-4o.,2025,Digital health,,,,"BACKGROUND: In Japan, consumers can purchase most over-the-counter (OTC) drugs without pharmacist guidance. Recently, generative artificial intelligence (AI) has become increasingly popular. Therefore, medical professionals need to consider the use of generative AI by consumers for medication counseling. We have previously reported responses in Japanese from ChatGPT-3.5 to 264 questions regarding whether each of 22 OTC drugs can be taken under 12 typical patient conditions. The proportion of responses that satisfied the criteria of 1) accuracy, 2) relevance, and 3) reliability with respect to package insert instructions was 20.8%. In November 2023, GPTs were launched, enabling us to construct a customized ChatGPT, using natural language. In the present study, we compared performance in providing medication guidance among a newly customized GPT, the latest non-customized version ChatGPT-4o, and the previous version, ChatGPT-3.5. The aim was to determine whether the customization and version update of ChatGPT improved performance and to evaluate its potential usefulness. METHODS: We configured customized ChatGPT-4 by executing five instructions in Japanese and uploaded the text of package inserts for 22 OTC drugs as knowledge. We asked the same 264 questions as in our previous study. RESULTS: With the customized ChatGPT-4, the percentages of responses that satisfied the criteria of accuracy, relevance, and reliability were 93.2%, 100%, and 60.2%, respectively. Additionally, 56.1% of responses satisfied all three criteria, 2.7-fold higher compared with ChatGPT-3.5 and 1.3-fold higher compared with ChatGPT-4o. CONCLUSION: The performance of our customized GPT far exceeded that of ChatGPT-3.5. In particular, the proportion of appropriate responses to the questions using brand names was significantly improved. ChatGPT can be customized by providing drug package insert information and using appropriate prompt engineering, potentially offering helpful tools in clinical pharmacy.",Kiyomiya K; Aomori T; Ohtani H
39754097,Comparison of AI applications and anesthesiologist's anesthesia method choices.,2025,BMC anesthesiology,,,,"BACKGROUND: In medicine, Artificial intelligence has begun to be utilized in nearly every domain, from medical devices to the interpretation of imaging studies. There is still a need for more experience and more studies related to the comprehensive use of AI in medicine. The aim of the present study is to evaluate the ability of AI to make decisions regarding anesthesia methods and to compare the most popular AI programs from this perspective. METHODS: The study included orthopedic patients over 18 years of age scheduled for limb surgery within a 1-month period. Patients classified as ASA I-III who were evaluated in the anesthesia clinic during the preoperative period were included in the study. The anesthesia method preferred by the anesthesiologist during the operation and the patient's demographic data, comorbidities, medications, and surgical history were recorded. The obtained patient data were discussed as if presenting a patient scenario using the free versions of the ChatGPT, Copilot, and Gemini applications by a different anesthesiologist who did not perform the operation. RESULTS: Over the course of 1 month, a total of 72 patients were enrolled in the study. It was observed that both the anesthesia specialists and the Gemini application chose spinal anesthesia for the same patient in 68.5% of cases. This rate was higher compared to the other AI applications. For patients taking medication, it was observed that the Gemini application presented choices that were highly compatible (85.7%) with the anesthesiologists' preferences. CONCLUSION: AI cannot fully master the guidelines and exceptional and specific cases that arrive in the course of medical treatment. Thus, we believe that AI can serve as a valuable assistant rather than replacing doctors.",Celik E; Turgut MA; Aydogan M; Kilinc M; Toktas I; Akelma H
39099569,The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland.,2024,Clinical kidney journal,,,,"BACKGROUND: In November 2022, OpenAI released a chatbot named ChatGPT, a product capable of processing natural language to create human-like conversational dialogue. It has generated a lot of interest, including from the scientific community and the medical science community. Recent publications have shown that ChatGPT can correctly answer questions from medical exams such as the United States Medical Licensing Examination and other specialty exams. To date, there have been no studies in which ChatGPT has been tested on specialty questions in the field of nephrology anywhere in the world. METHODS: Using the ChatGPT-3.5 and -4.0 algorithms in this comparative cross-sectional study, we analysed 1560 single-answer questions from the national specialty exam in nephrology from 2017 to 2023 that were available in the Polish Medical Examination Center's question database along with answer keys. RESULTS: Of the 1556 questions posed to ChatGPT-4.0, correct answers were obtained with an accuracy of 69.84%, compared with ChatGPT-3.5 (45.70%, P = .0001) and with the top results of medical doctors (85.73%, P = .0001). Of the 13 tests, ChatGPT-4.0 exceeded the required >/=60% pass rate in 11 tests passed, and scored higher than the average of the human exam results. CONCLUSION: ChatGPT-3.5 was not spectacularly successful in nephrology exams. The ChatGPT-4.0 algorithm was able to pass most of the analysed nephrology specialty exams. New generations of ChatGPT achieve similar results to humans. The best results of humans are better than those of ChatGPT-4.0.",Nicikowski J; Szczepanski M; Miedziaszczyk M; Kudlinski B
37190006,May Artificial Intelligence Influence Future Pediatric Research?-The Case of ChatGPT.,2023,"Children (Basel, Switzerland)",,,,"BACKGROUND: In recent months, there has been growing interest in the potential of artificial intelligence (AI) to revolutionize various aspects of medicine, including research, education, and clinical practice. ChatGPT represents a leading AI language model, with possible unpredictable effects on the quality of future medical research, including clinical decision-making, medical education, drug development, and better research outcomes. AIM AND METHODS: In this interview with ChatGPT, we explore the potential impact of AI on future pediatric research. Our discussion covers a range of topics, including the potential positive effects of AI, such as improved clinical decision-making, enhanced medical education, faster drug development, and better research outcomes. We also examine potential negative effects, such as bias and fairness concerns, safety and security issues, overreliance on technology, and ethical considerations. CONCLUSIONS: While AI continues to advance, it is crucial to remain vigilant about the possible risks and limitations of these technologies and to consider the implications of these technologies and their use in the medical field. The development of AI language models represents a significant advancement in the field of artificial intelligence and has the potential to revolutionize daily clinical practice in every branch of medicine, both surgical and clinical. Ethical and social implications must also be considered to ensure that these technologies are used in a responsible and beneficial manner.",Corsello A; Santangelo A
39578313,Leveraging ChatGPT for Enhanced Aesthetic Evaluations in Minimally Invasive Facial Procedures.,2025,Aesthetic plastic surgery,,,,"BACKGROUND: In recent years, the application of AI technologies like ChatGPT has gained traction in the field of plastic surgery. AI models can analyze pre- and post-treatment images to offer insights into the effectiveness of cosmetic procedures. This technological advancement enables rapid, objective evaluations that can complement traditional assessment methods, providing a more comprehensive understanding of treatment outcomes. OBJECTIVE: The study aimed to comprehensively assess the effectiveness of custom ChatGPT model, ""Face Rating and Review AI,"" in facial feature evaluation in minimally invasive aesthetic procedures, particularly before and after Botox treatments. METHOD: An analysis was conducted on the Web of Science (WoS) database, identifying 79 articles published between 2023 and 2024 on ChatGPT in the field of plastic surgery from various countries. A dataset of 23 patients from Kaggle, including pre- and post-Botox images, was used. The custom ChatGPT model, ""Face Rating & Review AI,"" was used to assess facial features based on objective parameters such as the golden ratio, symmetry, proportion, side angles, skin condition, and overall harmony, as well as subjective parameters like personality, temperament, and social attraction. RESULT: The WoS search found 79 articles on ChatGPT in plastic surgery from 27 countries, with most publications originating from the USA, Australia, and Italy. The objective and subjective parameters were analyzed using a paired t-test, and all facial features showed low p-values (<0.05). Higher mean scores on features such as the golden ratio (mean = 5.86, SD = 0.69), skin condition (mean = 3.78, SD = 0.73), and personality (mean = 5.0, SD = 0.79) indicate positive shifts after the treatment. CONCLUSION: The custom ChatGPT model ""Face Rating and Review AI"" is a valuable tool for assessing facial features in Botox treatments. It effectively evaluates objective and subjective attributes, aiding clinical decision-making. However, ethical considerations highlight the need for diverse datasets in future research to improve accuracy and inclusivity. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Ali R; Cui H
38838389,Empowering gynaecologists with Artificial Intelligence: Tailoring surgical solutions for fibroids.,2024,"European journal of obstetrics, gynecology, and reproductive biology",,,,"BACKGROUND: In recent years, the integration ofArtificial intelligence (AI) into various fields of medicine including Gynaecology, has shown promising potential. Surgical treatment of fibroid is myomectomy if uterine preservation and fertility are the primary aims. AI usage begins with the involvement of LLM (Large Language Model) from the point when a patient visits a gynecologist, from identifying signs and symptoms to reaching a diagnosis, providing treatment plans, and patient counseling. OBJECTIVE: Use of AI (ChatGPT versus Google Bard) in the surgical management of fibroid. STUDY DESIGN: Identifyingthe patient's problems using LLMs like ChatGPT and Google Bard and giving a treatment optionin 8 clinical scenarios of fibroid. Data entry was done using M.S. Excel and was statistically analyzed using Statistical Package for Social Sciences (SPSS Version 26) for M.S. Windows 2010. All results were presented in tabular form. Data were analyzed using nonparametric tests Chi-square tests or Fisher exact test.pvalues < 0.05 were considered statistically significant. The sensitivity of both techniques was calculated. We have used Cohen's Kappa to know the degree of agreement. RESULTS: We found that on the first attempt, ChatGPT gave general answers in 62.5 % of cases and specific answers in 37.5 % of cases. ChatGPT showed improved sensitivity on successive prompts 37.5 % to 62.5 % on the third prompt. Google Bard could not identify the clinical question in 50 % of cases and gave incorrect answers in 12.5 % of cases (p = 0.04). Google Bard showed the same sensitivity of 25 % on all prompts. CONCLUSION: AI helps to reduce the time to diagnose and plan a treatment strategy for fibroid and acts as a powerful tool in the hands of a gynecologist. However, the usage of AI by patients for self-treatment is to be avoided and should be used only for education and counseling about fibroids.",Sinha R; Raina R; Bag M; Rupa B
38592758,Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.,2024,JMIR medical informatics,,,,"BACKGROUND: In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear. OBJECTIVE: This study aims to assess the impact of adding image data on ChatGPT-4's diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data. METHODS: We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis. RESULTS: The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V's performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, chi(2) test). Additionally, ChatGPT-4's self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases. CONCLUSIONS: Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.",Hirosawa T; Harada Y; Tokumasu K; Ito T; Suzuki T; Shimizu T
38775367,Evaluation of ChatGPT as a Tool for Answering Clinical Questions in Pharmacy Practice.,2024,Journal of pharmacy practice,,,,"Background: In the healthcare field, there has been a growing interest in using artificial intelligence (AI)-powered tools to assist healthcare professionals, including pharmacists, in their daily tasks. Objectives: To provide commentary and insight into the potential for generative AI language models such as ChatGPT as a tool for answering practice-based, clinical questions and the challenges that need to be addressed before implementation in pharmacy practice settings. Methods: To assess ChatGPT, pharmacy-based questions were prompted to ChatGPT (Version 3.5; free version) and responses were recorded. Question types included 6 drug information questions, 6 enhanced prompt drug information questions, 5 patient case questions, 5 calculations questions, and 10 drug knowledge questions (e.g., top 200 drugs). After all responses were collected, ChatGPT responses were assessed for appropriateness. Results: ChatGPT responses were generated from 32 questions in 5 categories and evaluated on a total of 44 possible points. Among all ChatGPT responses and categories, the overall score was 21 of 44 points (47.73%). ChatGPT scored higher in pharmacy calculation (100%), drug information (83%), and top 200 drugs (80%) categories and lower in drug information enhanced prompt (33%) and patient case (20%) categories. Conclusion: This study suggests that ChatGPT has limited success as a tool to answer pharmacy-based questions. ChatGPT scored higher in calculation and multiple-choice questions but scored lower in drug information and patient case questions, generating misleading or fictional answers and citations.",Munir F; Gehres A; Wai D; Song L
39719573,Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.,2024,BMC medical informatics and decision making,,,,"BACKGROUND: Interactive artificial intelligence tools such as ChatGPT have gained popularity, yet little is known about their reliability as a reference tool for healthcare-related information for healthcare providers and trainees. The objective of this study was to assess the consistency, quality, and accuracy of the responses generated by ChatGPT on healthcare-related inquiries. METHODS: A total of 18 open-ended questions including six questions in three defined clinical areas (2 each to address ""what"", ""why"", and ""how"", respectively) were submitted to ChatGPT v3.5 based on real-world usage experience. The experiment was conducted in duplicate using 2 computers. Five investigators independently ranked each response using a 4-point scale to rate the quality of the bot's responses. The Delphi method was used to compare each investigator's score with the goal of reaching at least 80% consistency. The accuracy of the responses was checked using established professional references and resources. When the responses were in question, the bot was asked to provide reference material used for the investigators to determine the accuracy and quality. The investigators determined the consistency, accuracy, and quality by establishing a consensus. RESULTS: The speech pattern and length of the responses were consistent within the same user but different between users. Occasionally, ChatGPT provided 2 completely different responses to the same question. Overall, ChatGPT provided more accurate responses (8 out of 12) to the ""what"" questions with less reliable performance to the ""why"" and ""how"" questions. We identified errors in calculation, unit of measurement, and misuse of protocols by ChatGPT. Some of these errors could result in clinical decisions leading to harm. We also identified citations and references shown by ChatGPT that did not exist in the literature. CONCLUSIONS: ChatGPT is not ready to take on the coaching role for either healthcare learners or healthcare professionals. The lack of consistency in the responses to the same question is problematic for both learners and decision-makers. The intrinsic assumptions made by the chatbot could lead to erroneous clinical decisions. The unreliability in providing valid references is a serious flaw in using ChatGPT to drive clinical decision making.",Shiferaw MW; Zheng T; Winter A; Mike LA; Chan LN
38401366,Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.,2024,International journal of nursing studies,,,,"BACKGROUND: Investigates the integration of an artificial intelligence tool, specifically ChatGPT, in nursing education, addressing its effectiveness in exam preparation and self-assessment. OBJECTIVE: This study aims to evaluate the performance of ChatGPT, one of the most promising artificial intelligence-driven linguistic understanding tools in answering question banks for nursing licensing examination preparation. It further analyzes question characteristics that might impact the accuracy of ChatGPT-generated answers and examines its reliability through human expert reviews. DESIGN: Cross-sectional survey comparing ChatGPT-generated answers and their explanations. SETTING: 400 questions from Taiwan's 2022 Nursing Licensing Exam. METHODS: The study analyzed 400 questions from five distinct subjects of Taiwan's 2022 Nursing Licensing Exam using the ChatGPT model which provided answers and in-depth explanations for each question. The impact of various question characteristics, such as type and cognitive level, on the accuracy of the ChatGPT-generated responses was assessed using logistic regression analysis. Additionally, human experts evaluated the explanations for each question, comparing them with the ChatGPT-generated answers to determine consistency. RESULTS: ChatGPT exhibited overall accuracy at 80.75 % for Taiwan's National Nursing Exam, which passes the exam. The accuracy of ChatGPT-generated answers diverged significantly across test subjects, demonstrating a hierarchy ranging from General Medicine at 88.75 %, Medical-Surgical Nursing at 80.0 %, Psychology and Community Nursing at 70.0 %, Obstetrics and Gynecology Nursing at 67.5 %, down to Basic Nursing at 63.0 %. ChatGPT had a higher probability of eliciting incorrect responses for questions with certain characteristics, notably those with clinical vignettes [odds ratio 2.19, 95 % confidence interval 1.24-3.87, P = 0.007] and complex multiple-choice questions [odds ratio 2.37, 95 % confidence interval 1.00-5.60, P = 0.049]. Furthermore, 14.25 % of ChatGPT-generated answers were inconsistent with their explanations, leading to a reduction in the overall accuracy to 74 %. CONCLUSIONS: This study reveals the ChatGPT's capabilities and limitations in nursing exam preparation, underscoring its potential as an auxiliary educational tool. It highlights the model's varied performance across different question types and notable inconsistencies between its answers and explanations. The study contributes significantly to the understanding of artificial intelligence in learning environments, guiding the future development of more effective and reliable artificial intelligence-based educational technologies. TWEETABLE ABSTRACT: New study reveals ChatGPT's potential and challenges in nursing education: Achieves 80.75 % accuracy in exam prep but faces hurdles with complex questions and logical consistency. #AIinNursing #AIinEducation #NursingExams #ChatGPT.",Su MC; Lin LE; Lin LH; Chen YC
37662036,"The utility of ChatGPT in the assessment of literature on the prevention of migraine: an observational, qualitative study.",2023,Frontiers in neurology,,,,"BACKGROUND: It is not known how large language models, such as ChatGPT, can be applied toward the assessment of the efficacy of medications, including in the prevention of migraine, and how it might support those claims with existing medical evidence. METHODS: We queried ChatGPT-3.5 on the efficacy of 47 medications for the prevention of migraine and then asked it to give citations in support of its assessment. ChatGPT's evaluations were then compared to their FDA approval status for this indication as well as the American Academy of Neurology 2012 evidence-based guidelines for the prevention of migraine. The citations ChatGPT generated for these evaluations were then assessed to see if they were real papers and if they were relevant to the query. RESULTS: ChatGPT affirmed that the 14 medications that have either received FDA approval for prevention of migraine or AAN Grade A/B evidence were effective for migraine. Its assessments of the other 33 medications were unreliable including suggesting possible efficacy for four medications that have never been used for the prevention of migraine. Critically, only 33/115 (29%) of the papers ChatGPT cited were real, while 76/115 (66%) were ""hallucinated"" not real papers and 6/115 (5%) shared the names of real papers but had not real citations. CONCLUSION: While ChatGPT produced tailored answers on the efficacy of the queried medications, the results were unreliable and inaccurate because of the overwhelming volume of ""hallucinated"" articles it generated and cited.",Moskatel LS; Zhang N
38348835,Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).,2024,Acta cardiologica,,,,"BACKGROUND: It is thought that ChatGPT, an advanced language model developed by OpenAI, may in the future serve as an AI-assisted decision support tool in medicine. OBJECTIVE: To evaluate the accuracy of ChatGPT's recommendations on medical questions related to common cardiac symptoms or conditions. METHODS: We tested ChatGPT's ability to address medical questions in two ways. First, we assessed its accuracy in correctly answering cardiovascular trivia questions (n = 50), based on quizzes for medical professionals. Second, we entered 20 clinical case vignettes on the ChatGPT platform and evaluated its accuracy compared to expert opinion and clinical course. Lastly, we compared the latest research version (v3.5; 27 September 2023) with a prior version (v3.5; 30 January 2023) to evaluate improvement over time. RESULTS: We found that ChatGPT latest version correctly answered 92% of the trivia questions, with slight variation in accuracy in the domains coronary artery disease (100%), pulmonary and venous thrombotic embolism (100%), atrial fibrillation (90%), heart failure (90%) and cardiovascular risk management (80%). In the 20 case vignettes, ChatGPT's response matched in 17 (85%) of the cases with the actual advice given. Straightforward patient-to-physician questions were all answered correctly (10/10). In more complex cases, where physicians (general practitioners) asked other physicians (cardiologists) for assistance or decision support, ChatGPT was correct in 70% of cases, and otherwise provided incomplete, inconclusive, or inappropriate recommendations when compared with expert consultation. ChatGPT showed significant improvement over time; as the January version correctly answered 74% (vs 92%) of trivia questions (p = 0.031), and correctly answered a mere 50% of complex cases. CONCLUSIONS: Our study suggests that ChatGPT has potential as an AI-assisted decision support tool in medicine, particularly for straightforward, low-complex medical questions, but further research is needed to fully evaluate its potential.",Harskamp RE; De Clercq L
40332991,Comparing Artificial Intelligence-Generated and Clinician-Created Personalized Self-Management Guidance for Patients With Knee Osteoarthritis: Blinded Observational Study.,2025,Journal of medical Internet research,,,,"BACKGROUND: Knee osteoarthritis is a prevalent, chronic musculoskeletal disorder that impairs mobility and quality of life. Personalized patient education aims to improve self-management and adherence; yet, its delivery is often limited by time constraints, clinician workload, and the heterogeneity of patient needs. Recent advances in large language models offer potential solutions. GPT-4 (OpenAI), distinguished by its long-context reasoning and adoption in clinical artificial intelligence research, emerged as a leading candidate for personalized health communication. However, its application in generating condition-specific educational guidance remains underexplored, and concerns about misinformation, personalization limits, and ethical oversight remain. OBJECTIVE: We evaluated GPT-4's ability to generate individualized self-management guidance for patients with knee osteoarthritis in comparison with clinician-created content. METHODS: This 2-phase, double-blind, observational study used data from 50 patients previously enrolled in a registered randomized trial. In phase 1, 2 orthopedic clinicians each generated personalized education materials for 25 patient profiles using anonymized clinical data, including history, symptoms, and lifestyle. In phase 2, the same datasets were processed by GPT-4 using standardized prompts. All content was anonymized and evaluated by 2 independent, blinded clinical experts using validated scoring systems. Evaluation criteria included efficiency, readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau, and Simple Measure of Gobbledygook), accuracy, personalization, and comprehensiveness and safety. Disagreements between reviewers were resolved through consensus or third-party adjudication. RESULTS: GPT-4 outperformed clinicians in content generation speed (530.03 vs 37.29 words per min, P<.001). Readability was better on the Flesch-Kincaid (mean 11.56, SD 1.08 vs mean 12.67 SD 0.95), Gunning Fog (mean 12.47, SD 1.36 vs mean 14.56, SD 0.93), and Simple Measure of Gobbledygook (mean 13.33, SD 1.00 vs mean 13.81 SD 0.69) indices (all P<.001), though GPT-4 scored slightly higher on the Coleman-Liau Index (mean 15.90, SD 1.03 vs mean 15.15, SD 0.91). GPT-4 also outperformed clinicians in accuracy (mean 5.31, SD 1.73 vs mean 4.76, SD 1.10; P=.05, personalization (mean 54.32, SD 6.21 vs mean 33.20, SD 5.40; P<.001), comprehensiveness (mean 51.74, SD 6.47 vs mean 35.26, SD 6.66; P<.001), and safety (median 61, IQR 58-66 vs median 50, IQR 47-55.25; P<.001). CONCLUSIONS: GPT-4 could generate personalized self-management guidance for knee osteoarthritis with greater efficiency, accuracy, personalization, comprehensiveness, and safety than clinician-generated content, as assessed using standardized, guideline-aligned evaluation frameworks. These findings underscore the potential of large language models to support scalable, high-quality patient education in chronic disease management. The observed lexical complexity suggests the need to refine outputs for populations with limited health literacy. As an exploratory, single-center study, these results warrant confirmation in larger, multicenter cohorts with diverse demographic profiles. Future implementation should be guided by ethical and operational safeguards, including data privacy, transparency, and the delineation of clinical responsibility. Hybrid models integrating artificial intelligence-generated content with clinician oversight may offer a pragmatic path forward.",Du K; Li A; Zuo QH; Zhang CY; Guo R; Chen P; Du WS; Li SM
37725411,Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.,2023,JMIR medical education,,,,"BACKGROUND: Large language model (LLM)-based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. OBJECTIVE: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. METHODS: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot's responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. RESULTS: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). CONCLUSIONS: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.",Huang RS; Lu KJQ; Meaney C; Kemppainen J; Punnett A; Leung FH
38744501,Evaluating the diagnostic performance of a large language model-powered chatbot for providing immunohistochemistry recommendations in dermatopathology.,2024,Journal of cutaneous pathology,,,,"BACKGROUND: Large language model (LLM)-powered chatbots such as ChatGPT have numerous applications. However, their effectiveness in dermatopathology has not been formally evaluated. Dermatopathological cases often require immunohistochemical workup. Here, we evaluate the performance of a chatbot in providing diagnostically useful information on immunohistochemistry relating to dermatological diseases. METHODS: We queried a commonly used chatbot for the immunophenotypes of 51 cutaneous diseases, including a diverse variety of epidermal, adnexal, hematolymphoid, and soft tissue entities. We requested it to provide references for each diagnosis. All tests were repeated, compiled, quantified, and then compared with established literature standards. RESULTS: Clustering analysis demonstrated that recommendations correlated with tumor type, suggesting chatbots can supply appropriate panels. However, a significant portion of recommendations were factually incorrect (13.9%). Citations were rarely clinically useful (24.5%). Many were confabulated (27.2%). Prompt responses for cutaneous adnexal lesions tended to be less accurate while literature references were less useful. Reference retrieval performance was associated with the number of PubMed entries per entity. CONCLUSIONS: This foundational study suggests that LLM-powered chatbots may be useful for generating immunohistochemical panels for dermatologic diagnoses. However, specific performance capabilities and biases must be considered. In addition, extreme caution is advised regarding the tendencies to fabricate material. Future models intentionally fine-tuned to augment diagnostic medicine may prove to be valuable.",McCrary MR; Galambus J; Chen WS
38717811,ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study.,2024,JMIR formative research,,,,"BACKGROUND: Large language models (LLMs) are computational artificial intelligence systems with advanced natural language processing capabilities that have recently been popularized among health care students and educators due to their ability to provide real-time access to a vast amount of medical knowledge. The adoption of LLM technology into medical education and training has varied, and little empirical evidence exists to support its use in clinical teaching environments. OBJECTIVE: The aim of the study is to identify and qualitatively evaluate potential use cases and limitations of LLM technology for real-time ward-based educational contexts. METHODS: A brief, single-site exploratory evaluation of the publicly available ChatGPT-3.5 (OpenAI) was conducted by implementing the tool into the daily attending rounds of a general internal medicine inpatient service at a large urban academic medical center. ChatGPT was integrated into rounds via both structured and organic use, using the web-based ""chatbot"" style interface to interact with the LLM through conversational free-text and discrete queries. A qualitative approach using phenomenological inquiry was used to identify key insights related to the use of ChatGPT through analysis of ChatGPT conversation logs and associated shorthand notes from the clinical sessions. RESULTS: Identified use cases for ChatGPT integration included addressing medical knowledge gaps through discrete medical knowledge inquiries, building differential diagnoses and engaging dual-process thinking, challenging medical axioms, using cognitive aids to support acute care decision-making, and improving complex care management by facilitating conversations with subspecialties. Potential additional uses included engaging in difficult conversations with patients, exploring ethical challenges and general medical ethics teaching, personal continuing medical education resources, developing ward-based teaching tools, supporting and automating clinical documentation, and supporting productivity and task management. LLM biases, misinformation, ethics, and health equity were identified as areas of concern and potential limitations to clinical and training use. A code of conduct on ethical and appropriate use was also developed to guide team usage on the wards. CONCLUSIONS: Overall, ChatGPT offers a novel tool to enhance ward-based learning through rapid information querying, second-order content exploration, and engaged team discussion regarding generated responses. More research is needed to fully understand contexts for educational use, particularly regarding the risks and limitations of the tool in clinical settings and its impacts on trainee development.",Skryd A; Lawrence K
39729356,Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.,2024,Journal of medical Internet research,,,,"BACKGROUND: Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored. OBJECTIVE: This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education. METHODS: A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement. RESULTS: A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts. CONCLUSIONS: MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts.",Zong H; Wu R; Cha J; Wang J; Wu E; Li J; Zhou Y; Zhang C; Feng W; Shen B
39546795,Examining the Role of Large Language Models in Orthopedics: Systematic Review.,2024,Journal of medical Internet research,,,,"BACKGROUND: Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent. OBJECTIVE: The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges. METHODS: PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of ""large language model,"" ""generative artificial intelligence,"" ""ChatGPT,"" and ""orthopaedics,"" were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment. RESULTS: A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs' performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4's accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections. CONCLUSIONS: LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.",Zhang C; Liu S; Zhou X; Zhou S; Tian Y; Wang S; Xu N; Li W
38952020,Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.,2024,JMIR medical informatics,,,,"BACKGROUND: Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs' responses create substantial risks, potentially threatening patients' physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation. OBJECTIVE: We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks. METHODS: First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory. RESULTS: Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario. CONCLUSIONS: MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.",Xu J; Lu L; Peng X; Pang J; Ding J; Yang L; Song H; Li K; Sun X; Zhang S
38875696,"Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.",2024,Journal of medical Internet research,,,,"BACKGROUND: Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage. OBJECTIVE: This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models' responses can enhance the triage proficiency of untrained personnel. METHODS: A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters' MTS level assignments, measured via quadratic-weighted Cohen kappa. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b. RESULTS: GPT-4-based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (kappa=mean 0.67, SD 0.037 and kappa=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5-based ChatGPT (kappa=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (kappa=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4-based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged. CONCLUSIONS: While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models' triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors' triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.",Masanneck L; Schmidt L; Seifert A; Kolsche T; Huntemann N; Jansen R; Mehsin M; Bernhard M; Meuth SG; Bohm L; Pawlitzki M
37665620,"Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany.",2023,JMIR medical education,,,,"BACKGROUND: Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation. OBJECTIVE: This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students. METHODS: The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated. RESULTS: GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty. CONCLUSIONS: LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.",Roos J; Kasapovic A; Jansen T; Kaczmarczyk R
40229614,"Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude.",2025,Aesthetic plastic surgery,,,,"BACKGROUND: Large language models (LLMs) have demonstrated transformative potential in health care. They can enhance clinical and academic medicine by facilitating accurate diagnoses, interpreting laboratory results, and automating documentation processes. This study evaluates the efficacy of LLMs in generating surgical operation reports and discharge summaries, focusing on accuracy, efficiency, and quality. METHODS: This study assessed the effectiveness of three leading LLMs-ChatGPT-4.0, ChatGPT-4o, and Claude-using six prompts and analyzing their responses for readability and output quality, validated by plastic surgeons. Readability was measured with the Flesch-Kincaid, Flesch reading ease scores, and Coleman-Liau Index, while reliability was evaluated using the DISCERN score. A paired two-tailed t-test (p<0.05) compared the statistical significance of these metrics and the time taken to generate operation reports and discharge summaries against the authors' results. RESULTS: Table 3 shows statistically significant differences in readability between ChatGPT-4o and Claude across all metrics, while ChatGPT-4 and Claude differ significantly in the Flesch reading ease and Coleman-Liau indices. Table 6 reveals extremely low p-values across BL, IS, and MM for all models, with Claude consistently outperforming both ChatGPT-4 and ChatGPT-4o. Additionally, Claude generated documents the fastest, completing tasks in approximately 10 to 14 s. These results suggest that Claude not only excels in readability but also demonstrates superior reliability and speed, making it an efficient choice for practical applications. CONCLUSION: The study highlights the importance of selecting appropriate LLMs for clinical use. Integrating these LLMs can streamline healthcare documentation, improve efficiency, and enhance patient outcomes through clearer communication and more accurate medical reports. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Lim B; Seth I; Maxwell M; Cuomo R; Ross RJ; Rozen WM
40305085,Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.,2025,Journal of medical Internet research,,,,"BACKGROUND: Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent. OBJECTIVE: This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field. METHODS: In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy. RESULTS: The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification. CONCLUSIONS: Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios. TRIAL REGISTRATION: PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245.",Wang L; Li J; Zhuang B; Huang S; Fang M; Wang C; Li W; Zhang M; Gong S
38343631,Almanac - Retrieval-Augmented Language Models for Clinical Medicine.,2024,NEJM AI,,,,"BACKGROUND: Large language models (LLMs) have recently shown impressive zero-shot capabilities, whereby they can use auxiliary data, without the availability of task-specific training examples, to complete a variety of natural language tasks, such as summarization, dialogue generation, and question answering. However, despite many promising applications of LLMs in clinical medicine, adoption of these models has been limited by their tendency to generate incorrect and sometimes even harmful statements. METHODS: We tasked a panel of eight board-certified clinicians and two health care practitioners with evaluating Almanac, an LLM framework augmented with retrieval capabilities from curated medical resources for medical guideline and treatment recommendations. The panel compared responses from Almanac and standard LLMs (ChatGPT-4, Bing, and Bard) versus a novel data set of 314 clinical questions spanning nine medical specialties. RESULTS: Almanac showed a significant improvement in performance compared with the standard LLMs across axes of factuality, completeness, user preference, and adversarial safety. CONCLUSIONS: Our results show the potential for LLMs with access to domain-specific corpora to be effective in clinical decision-making. The findings also underscore the importance of carefully testing LLMs before deployment to mitigate their shortcomings. (Funded by the National Institutes of Health, National Heart, Lung, and Blood Institute.).",Zakka C; Shad R; Chaurasia A; Dalal AR; Kim JL; Moor M; Fong R; Phillips C; Alexander K; Ashley E; Boyd J; Boyd K; Hirsch K; Langlotz C; Lee R; Melia J; Nelson J; Sallam K; Tullis S; Vogelsong MA; Cunningham JP; Hiesinger W
40289855,Evaluation of Six Large Language Models for Clinical Decision Support: Application in Transfusion Decision-making for RhD Blood-type Patients.,2025,Annals of laboratory medicine,,,,"BACKGROUND: Large language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare. METHODS: Fifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English. Results were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated. RESULTS: GPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini (P <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated. CONCLUSIONS: GPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.",Lee JK; Choi S; Park S; Hwang SH; Cho D
40217905,Thyro-GenAI: A Chatbot Using Retrieval-Augmented Generative Models for Personalized Thyroid Disease Management.,2025,Journal of clinical medicine,,,,"Background: Large language models (LLMs) have the potential to enhance information processing and clinical reasoning in the healthcare industry but are hindered by inaccuracies and hallucinations. The retrieval-augmented generation (RAG) technique may address these problems by integrating external knowledge sources. Methods: We developed a RAG-based chatbot called Thyro-GenAI by integrating a database of textbooks and guidelines with LLM. Thyro-GenAI and three service LLMs: OpenAI's ChatGPT-4o, Perplexity AI's ChatGPT-4o, and Anthropic's Claude 3.5 Sonnet, were asked personalized clinical questions about thyroid disease. Three thyroid specialists assessed the quality of the generated responses and references without being blinded, which allowed them to interact with different chatbot interfaces. Results: Thyro-GenAI achieved the highest inverse-weighted mean rank for overall response quality. The overall inverse-weighted mean rankings for Thyro-GenAI, ChatGPT, Perplexity, and Claude were 3.0, 2.3, 2.8, and 1.9, respectively. Thyro-GenAI also achieved the second-highest inverse-weighted mean rank for overall reference quality. The overall inverse-weighted mean rankings for Thyro-GenAI, ChatGPT, Perplexity, and Claude were 3.1, 2.3, 3.2, and 1.8, respectively. Conclusions: Thyro-GenAI produced patient-specific clinical reasoning output based on a vector database, with fewer hallucinations and more reliability, compared to service LLMs. This emphasis on evidence-based responses ensures its safety and validity, addressing a critical limitation of existing LLMs. By integrating RAG with LLMs, it has the potential to support frontline clinical decision-making, especially helping first-line physicians by offering reliable decision support while managing thyroid disease patients.",Shin M; Song J; Kim MG; Yu HW; Choe EK; Chai YJ
40295957,Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation.,2025,BMC medical research methodology,,,,"BACKGROUND: Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis. OBJECTIVE: In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis. METHODS: Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT. RESULTS: A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9. CONCLUSIONS: We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.",Cai X; Geng Y; Du Y; Westerman B; Wang D; Ma C; Vallejo JJG
40072530,[Integration of large language models into the clinic : Revolution in analysing and processing patient data to increase efficiency and quality in radiology].,2025,"Radiologie (Heidelberg, Germany)",,,,"BACKGROUND: Large Language Models (LLMs) like ChatGPT, Llama and Claude are transforming healthcare by interpreting complex text, extracting information, and providing guideline-based support. Radiology, with its high patient volume and digital workflows, is a ideal field for LLM integration. OBJECTIVE: Assessment of the potential of LLMs to enhance efficiency, standardization, and decision support in radiology, while addressing ethical and regulatory challenges. MATERIAL AND METHODS: Pilot studies at Freiburg and Basel university hospitals evaluated local LLM systems for tasks like prior report summarization and guideline-driven reporting. Integration with Picture Archiving and Communication System (PACS) and Electronic Health Record (EHR) systems was achieved via Digital Imaging and Communications in Medicine (DICOM) and Fast Healthcare Interoperability Resources (FHIR) standards. Metrics included time savings, compliance with the European Union (EU) Artificial Intelligence (AI) Act, and user acceptance. RESULTS: LLMs demonstrate significant potential as a support tool for radiologists in clinical practice by reducing reporting times, automating routine tasks, and ensuring consistent, high-quality results. They also support interdisciplinary workflows (e.g., tumor boards) and meet data protection requirements when locally implemented. DISCUSSION: Local LLM systems are feasible and beneficial in radiology, enhancing efficiency and diagnostic quality. Future work should refine transparency, expand applications, and ensure LLMs complement medical expertise while adhering to ethical and legal standards.",Arnold P; Henkel M; Bamberg F; Kotter E
40055694,A systematic review of large language model (LLM) evaluations in clinical medicine.,2025,BMC medical informatics and decision making,,,,"BACKGROUND: Large Language Models (LLMs), advanced AI tools based on transformer architectures, demonstrate significant potential in clinical medicine by enhancing decision support, diagnostics, and medical education. However, their integration into clinical workflows requires rigorous evaluation to ensure reliability, safety, and ethical alignment. OBJECTIVE: This systematic review examines the evaluation parameters and methodologies applied to LLMs in clinical medicine, highlighting their capabilities, limitations, and application trends. METHODS: A comprehensive review of the literature was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, and arXiv databases, encompassing both peer-reviewed and preprint studies. Studies were screened against predefined inclusion and exclusion criteria to identify original research evaluating LLM performance in medical contexts. RESULTS: The results reveal a growing interest in leveraging LLM tools in clinical settings, with 761 studies meeting the inclusion criteria. While general-domain LLMs, particularly ChatGPT and GPT-4, dominated evaluations (93.55%), medical-domain LLMs accounted for only 6.45%. Accuracy emerged as the most commonly assessed parameter (21.78%). Despite these advancements, the evidence base highlights certain limitations and biases across the included studies, emphasizing the need for careful interpretation and robust evaluation frameworks. CONCLUSIONS: The exponential growth in LLM research underscores their transformative potential in healthcare. However, addressing challenges such as ethical risks, evaluation variability, and underrepresentation of critical specialties will be essential. Future efforts should prioritize standardized frameworks to ensure safe, effective, and equitable LLM integration in clinical practice.",Shool S; Adimi S; Saboori Amleshi R; Bitaraf E; Golpira R; Tara M
39591396,Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank.,2024,Pediatric emergency care,,,,"BACKGROUND: Large language models (LLMs), including ChatGPT (Chat Generative Pretrained Transformer), a popular, publicly available LLM, represent an important innovation in the application of artificial intelligence. These systems generate relevant content by identifying patterns in large text datasets based on user input across various topics. We sought to evaluate the performance of ChatGPT in practice test questions designed to assess knowledge competency for pediatric emergency medicine (PEM). METHODS: We evaluated the performance of ChatGPT for PEM board certification using a popular question bank used for board certification in PEM published between 2022 and 2024. Clinicians assessed performance of ChatGPT by inputting prompts and recording the software's responses, asking each question over 3 separate iterations. We calculated correct answer percentages (defined as correct in at least 2/3 iterations) and assessed for agreement between the iterations using Fleiss' kappa. RESULTS: We included 215 questions over the 3 study years. ChatGPT responded correctly to 161 of PREP EM questions over 3 years (74.5%; 95% confidence interval, 68.5%-80.5%), which was similar within each study year (75.0%, 71.8%, and 77.8% for study years 2022, 2023, and 2024, respectively). Among correct responses, most were answered correctly on all 3 iterations (137/161, 85.1%). Performance varied by topic, with the highest scores in research and medical specialties and lower in procedures and toxicology. Fleiss' kappa across the 3 iterations was 0.71, indicating substantial agreement. CONCLUSION: ChatGPT provided correct answers to PEM responses in three-quarters of cases, over the recommended minimum of 65% provided by the question publisher for passing. Responses by ChatGPT included detailed explanations, suggesting potential use for medical education. We identified limitations in specific topics and image interpretation. These results demonstrate opportunities for LLMs to enhance both the education and clinical practice of PEM.",Ramgopal S; Varma S; Gorski JK; Kester KM; Shieh A; Suresh S
38051578,ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions.,2023,JMIR medical education,,,,"BACKGROUND: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more ""consultations"" of LLMs about personal medical symptoms. OBJECTIVE: This study aims to evaluate ChatGPT's performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants' answers. METHODS: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs. RESULTS: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT's scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT's answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001). CONCLUSIONS: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants' answers. LLMs have potential as augmentative tools for medical care, but their ""consultation"" for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.",Buhr CR; Smith H; Huppertz T; Bahr-Hamm K; Matthias C; Blaikie A; Kelsey T; Kuhn S; Eckrich J
39019566,Development and evaluation of a large language model of ophthalmology in Chinese.,2024,The British journal of ophthalmology,,,,"BACKGROUND: Large language models (LLMs), such as ChatGPT, have considerable implications for various medical applications. However, ChatGPT's training primarily draws from English-centric internet data and is not tailored explicitly to the medical domain. Thus, an ophthalmic LLM in Chinese is clinically essential for both healthcare providers and patients in mainland China. METHODS: We developed an LLM of ophthalmology (MOPH) using Chinese corpora and evaluated its performance in three clinical scenarios: ophthalmic board exams in Chinese, answering evidence-based medicine-oriented ophthalmic questions and diagnostic accuracy for clinical vignettes. Additionally, we compared MOPH's performance to that of human doctors. RESULTS: In the ophthalmic exam, MOPH's average score closely aligned with the mean score of trainees (64.7 (range 62-68) vs 66.2 (range 50-92), p=0.817), but achieving a score above 60 in all seven mock exams. In answering ophthalmic questions, MOPH demonstrated an adherence of 83.3% (25/30) of responses following Chinese guidelines (Likert scale 4-5). Only 6.7% (2/30, Likert scale 1-2) and 10% (3/30, Likert scale 3) of responses were rated as 'poor or very poor' or 'potentially misinterpretable inaccuracies' by reviewers. In diagnostic accuracy, although the rate of correct diagnosis by ophthalmologists was superior to that by MOPH (96.1% vs 81.1%, p>0.05), the difference was not statistically significant. CONCLUSION: This study demonstrated the promising performance of MOPH, a Chinese-specific ophthalmic LLM, in diverse clinical scenarios. MOPH has potential real-world applications in Chinese-language ophthalmology settings.",Zheng C; Ye H; Guo J; Yang J; Fei P; Yuan Y; Huang D; Huang Y; Peng J; Xie X; Xie M; Zhao P; Chen L; Zhang M
37083633,Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.,2023,JMIR medical education,,,,"BACKGROUND: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. OBJECTIVE: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. METHODS: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model's answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners' reports from 2018 to 2022. Novel explanations from ChatGPT-defined as information provided that was not inputted within the question or multiple answer choices-were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT's strengths and weaknesses. RESULTS: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT's performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman rho=-0.241 and -0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). CONCLUSIONS: Large language models are approaching human expert-level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.",Thirunavukarasu AJ; Hassan R; Mahmood S; Sanghera R; Barzangi K; El Mukashfi M; Shah S
37337531,Evaluating the limits of AI in medical specialisation: ChatGPT's performance on the UK Neurology Specialty Certificate Examination.,2023,BMJ neurology open,,,,"BACKGROUND: Large language models such as ChatGPT have demonstrated potential as innovative tools for medical education and practice, with studies showing their ability to perform at or near the passing threshold in general medical examinations and standardised admission tests. However, no studies have assessed their performance in the UK medical education context, particularly at a specialty level, and specifically in the field of neurology and neuroscience. METHODS: We evaluated the performance of ChatGPT in higher specialty training for neurology and neuroscience using 69 questions from the Pool-Specialty Certificate Examination (SCE) Neurology Web Questions bank. The dataset primarily focused on neurology (80%). The questions spanned subtopics such as symptoms and signs, diagnosis, interpretation and management with some questions addressing specific patient populations. The performance of ChatGPT 3.5 Legacy, ChatGPT 3.5 Default and ChatGPT-4 models was evaluated and compared. RESULTS: ChatGPT 3.5 Legacy and ChatGPT 3.5 Default displayed overall accuracies of 42% and 57%, respectively, falling short of the passing threshold of 58% for the 2022 SCE neurology examination. ChatGPT-4, on the other hand, achieved the highest accuracy of 64%, surpassing the passing threshold and outperforming its predecessors across disciplines and subtopics. CONCLUSIONS: The advancements in ChatGPT-4's performance compared with its predecessors demonstrate the potential for artificial intelligence (AI) models in specialised medical education and practice. However, our findings also highlight the need for ongoing development and collaboration between AI developers and medical experts to ensure the models' relevance and reliability in the rapidly evolving field of medicine.",Giannos P
38261378,Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.,2024,Journal of medical Internet research,,,,"BACKGROUND: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to ""hallucinations"" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. OBJECTIVE: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. METHODS: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. RESULTS: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the ""pass"" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the ""remember"" (29/68) and ""understand"" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. CONCLUSIONS: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.",Herrmann-Werner A; Festl-Wietek T; Holderried F; Herschbach L; Griewatz J; Masters K; Zipfel S; Mahling M
40053752,Detecting Artificial Intelligence-Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study.,2025,JMIR medical education,,,,"BACKGROUND: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)-generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount. OBJECTIVE: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity-medical professionals and humanities scholars with expertise in textual analysis-to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features. METHODS: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants' characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text's authorship. RESULTS: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features-particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence (OR 6.62, 95% CI 1.25-35.2)-played a crucial role in participants' decisions to identify a text as AI-generated. CONCLUSIONS: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts' familiarity with the text content. As the decision-making process primarily relies on linguistic attributes-such as stylistic features and text coherence-further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers' ability to distinguish between student-authored and AI-generated work.",Doru B; Maier C; Busse JS; Lucke T; Schonhoff J; Enax-Krumova E; Hessler S; Berger M; Tokic M
37099373,"Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations.",2023,JMIR medical education,,,,"BACKGROUND: Large language models, such as ChatGPT by OpenAI, have demonstrated potential in various applications, including medical education. Previous studies have assessed ChatGPT's performance in university or professional settings. However, the model's potential in the context of standardized admission tests remains unexplored. OBJECTIVE: This study evaluated ChatGPT's performance on standardized admission tests in the United Kingdom, including the BioMedical Admissions Test (BMAT), Test of Mathematics for University Admission (TMUA), Law National Aptitude Test (LNAT), and Thinking Skills Assessment (TSA), to understand its potential as an innovative tool for education and test preparation. METHODS: Recent public resources (2019-2022) were used to compile a data set of 509 questions from the BMAT, TMUA, LNAT, and TSA covering diverse topics in aptitude, scientific knowledge and applications, mathematical thinking and reasoning, critical thinking, problem-solving, reading comprehension, and logical reasoning. This evaluation assessed ChatGPT's performance using the legacy GPT-3.5 model, focusing on multiple-choice questions for consistency. The model's performance was analyzed based on question difficulty, the proportion of correct responses when aggregating exams from all years, and a comparison of test scores between papers of the same exam using binomial distribution and paired-sample (2-tailed) t tests. RESULTS: The proportion of correct responses was significantly lower than incorrect ones in BMAT section 2 (P<.001) and TMUA paper 1 (P<.001) and paper 2 (P<.001). No significant differences were observed in BMAT section 1 (P=.2), TSA section 1 (P=.7), or LNAT papers 1 and 2, section A (P=.3). ChatGPT performed better in BMAT section 1 than section 2 (P=.047), with a maximum candidate ranking of 73% compared to a minimum of 1%. In the TMUA, it engaged with questions but had limited accuracy and no performance difference between papers (P=.6), with candidate rankings below 10%. In the LNAT, it demonstrated moderate success, especially in paper 2's questions; however, student performance data were unavailable. TSA performance varied across years with generally moderate results and fluctuating candidate rankings. Similar trends were observed for easy to moderate difficulty questions (BMAT section 1, P=.3; BMAT section 2, P=.04; TMUA paper 1, P<.001; TMUA paper 2, P=.003; TSA section 1, P=.8; and LNAT papers 1 and 2, section A, P>.99) and hard to challenging ones (BMAT section 1, P=.7; BMAT section 2, P<.001; TMUA paper 1, P=.007; TMUA paper 2, P<.001; TSA section 1, P=.3; and LNAT papers 1 and 2, section A, P=.2). CONCLUSIONS: ChatGPT shows promise as a supplementary tool for subject areas and test formats that assess aptitude, problem-solving and critical thinking, and reading comprehension. However, its limitations in areas such as scientific and mathematical knowledge and applications highlight the need for continuous development and integration with conventional learning strategies in order to fully harness its potential.",Giannos P; Delardas O
38153785,Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study.,2023,JMIR medical education,,,,"BACKGROUND: Large language models, such as ChatGPT, are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the internet. However, medical texts, such as clinical notes and diagnoses, require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to health care and the general public. OBJECTIVE: This study is among the first on responsible artificial intelligence-generated content in medicine. We focus on analyzing the differences between medical texts written by human experts and those generated by ChatGPT and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT. METHODS: We first constructed a suite of data sets containing medical texts written by human experts and generated by ChatGPT. We analyzed the linguistic features of these 2 types of content and uncovered differences in vocabulary, parts-of-speech, dependency, sentiment, perplexity, and other aspects. Finally, we designed and implemented machine learning methods to detect medical text generated by ChatGPT. The data and code used in this paper are published on GitHub. RESULTS: Medical texts written by humans were more concrete, more diverse, and typically contained more useful information, while medical texts generated by ChatGPT paid more attention to fluency and logic and usually expressed general terminologies rather than effective information specific to the context of the problem. A bidirectional encoder representations from transformers-based model effectively detected medical texts generated by ChatGPT, and the F(1) score exceeded 95%. CONCLUSIONS: Although text generated by ChatGPT is grammatically perfect and human-like, the linguistic characteristics of generated medical texts were different from those written by human experts. Medical text generated by ChatGPT could be effectively detected by the proposed machine learning algorithms. This study provides a pathway toward trustworthy and accountable use of large language models in medicine.",Liao W; Liu Z; Dai H; Xu S; Wu Z; Zhang Y; Huang X; Zhu D; Cai H; Li Q; Liu T; Li X
39322838,The Potential of Chat-Based Artificial Intelligence Models in Differentiating Between Keloid and Hypertrophic Scars: A Pilot Study.,2024,Aesthetic plastic surgery,,,,"BACKGROUND: Lasting scars such as keloids and hypertrophic scars adversely affect a patient's quality of life. However, these scars are frequently underdiagnosed because of the complexity of the current diagnostic criteria and classification systems. This study aimed to explore the application of Large Language Models (LLMs) such as ChatGPT in diagnosing scar conditions and to propose a more accessible and straightforward diagnostic approach. METHODS: In this study, five artificial intelligence (AI) chatbots, including ChatGPT-4 (GPT-4), Bing Chat (Precise, Balanced, and Creative modes), and Bard, were evaluated for their ability to interpret clinical scar images using a standardized set of prompts. Thirty mock images of various scar types were analyzed, and each chatbot was queried five times to assess the diagnostic accuracy. RESULTS: GPT-4 had a significantly higher accuracy rate in diagnosing scars than Bing Chat. The overall accuracy rates of GPT-4 and Bing Chat were 36.0% and 22.0%, respectively (P = 0.027), with GPT-4 showing better performance in terms of specificity for keloids (0.6 vs. 0.006) and hypertrophic scars (0.72 vs. 0.0) than Bing Chat. CONCLUSIONS: Although currently available LLMs show potential for use in scar diagnostics, the current technology is still under development and is not yet sufficient for clinical application standards, highlighting the need for further advancements in AI for more accurate medical diagnostics. LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online instructions to authors www.springer.com/00266 .",Shiraishi M; Miyamoto S; Takeishi H; Kurita D; Furuse K; Ohba J; Moriwaki Y; Fujisawa K; Okazaki M
39307579,ChatGPT vs. sleep disorder specialist responses to common sleep queries: Ratings by experts and laypeople.,2024,Sleep health,,,,"BACKGROUND: Many individuals use the Internet, including generative artificial intelligence like ChatGPT, for sleep-related information before consulting medical professionals. This study compared responses from sleep disorder specialists and ChatGPT to common sleep queries, with experts and laypersons evaluating the responses' accuracy and clarity. METHODS: We assessed responses from sleep medicine specialists and ChatGPT-4 to 140 sleep-related questions from the Korean Sleep Research Society's website. In a blinded study design, sleep disorder experts and laypersons rated the medical helpfulness, emotional supportiveness, and sentence comprehensibility of the responses on a 1-5 scale. RESULTS: Laypersons rated ChatGPT higher for medical helpfulness (3.79 +/- 0.90 vs. 3.44 +/- 0.99, p < .001), emotional supportiveness (3.48 +/- 0.79 vs. 3.12 +/- 0.98, p < .001), and sentence comprehensibility (4.24 +/- 0.79 vs. 4.14 +/- 0.96, p = .028). Experts also rated ChatGPT higher for emotional supportiveness (3.33 +/- 0.62 vs. 3.01 +/- 0.67, p < .001) but preferred specialists' responses for sentence comprehensibility (4.15 +/- 0.74 vs. 3.94 +/- 0.90, p < .001). When it comes to medical helpfulness, the experts rated the specialists' answers slightly higher than the laypersons did (3.70 +/- 0.84 vs. 3.63 +/- 0.87, p = .109). Experts slightly preferred specialist responses overall (56.0%), while laypersons favored ChatGPT (54.3%; p < .001). ChatGPT's responses were significantly longer (186.76 +/- 39.04 vs. 113.16 +/- 95.77 words, p < .001). DISCUSSION: Generative artificial intelligence like ChatGPT may help disseminate sleep-related medical information online. Laypersons appear to prefer ChatGPT's detailed, emotionally supportive responses over those from sleep disorder specialists.",Kim J; Lee SY; Kim JH; Shin DH; Oh EH; Kim JA; Cho JW
38602313,Importance of Patient History in Artificial Intelligence-Assisted Medical Diagnosis: Comparison Study.,2024,JMIR medical education,,,,"BACKGROUND: Medical history contributes approximately 80% to a diagnosis, although physical examinations and laboratory investigations increase a physician's confidence in the medical diagnosis. The concept of artificial intelligence (AI) was first proposed more than 70 years ago. Recently, its role in various fields of medicine has grown remarkably. However, no studies have evaluated the importance of patient history in AI-assisted medical diagnosis. OBJECTIVE: This study explored the contribution of patient history to AI-assisted medical diagnoses and assessed the accuracy of ChatGPT in reaching a clinical diagnosis based on the medical history provided. METHODS: Using clinical vignettes of 30 cases identified in The BMJ, we evaluated the accuracy of diagnoses generated by ChatGPT. We compared the diagnoses made by ChatGPT based solely on medical history with the correct diagnoses. We also compared the diagnoses made by ChatGPT after incorporating additional physical examination findings and laboratory data alongside history with the correct diagnoses. RESULTS: ChatGPT accurately diagnosed 76.6% (23/30) of the cases with only the medical history, consistent with previous research targeting physicians. We also found that this rate was 93.3% (28/30) when additional information was included. CONCLUSIONS: Although adding additional information improves diagnostic accuracy, patient history remains a significant factor in AI-assisted medical diagnosis. Thus, when using AI in medical diagnosis, it is crucial to include pertinent and correct patient histories for an accurate diagnosis. Our findings emphasize the continued significance of patient history in clinical diagnoses in this age and highlight the need for its integration into AI-assisted medical diagnosis systems.",Fukuzawa F; Yanagita Y; Yokokawa D; Uchida S; Yamashita S; Li Y; Shikino K; Tsukamoto T; Noda K; Uehara T; Ikusaka M
38526538,Performance of ChatGPT on the India Undergraduate Community Medicine Examination: Cross-Sectional Study.,2024,JMIR formative research,,,,"BACKGROUND: Medical students may increasingly use large language models (LLMs) in their learning. ChatGPT is an LLM at the forefront of this new development in medical education with the capacity to respond to multidisciplinary questions. OBJECTIVE: The aim of this study was to evaluate the ability of ChatGPT 3.5 to complete the Indian undergraduate medical examination in the subject of community medicine. We further compared ChatGPT scores with the scores obtained by the students. METHODS: The study was conducted at a publicly funded medical college in Hyderabad, India. The study was based on the internal assessment examination conducted in January 2023 for students in the Bachelor of Medicine and Bachelor of Surgery Final Year-Part I program; the examination of focus included 40 questions (divided between two papers) from the community medicine subject syllabus. Each paper had three sections with different weightage of marks for each section: section one had two long essay-type questions worth 15 marks each, section two had 8 short essay-type questions worth 5 marks each, and section three had 10 short-answer questions worth 3 marks each. The same questions were administered as prompts to ChatGPT 3.5 and the responses were recorded. Apart from scoring ChatGPT responses, two independent evaluators explored the responses to each question to further analyze their quality with regard to three subdomains: relevancy, coherence, and completeness. Each question was scored in these subdomains on a Likert scale of 1-5. The average of the two evaluators was taken as the subdomain score of the question. The proportion of questions with a score 50% of the maximum score (5) in each subdomain was calculated. RESULTS: ChatGPT 3.5 scored 72.3% on paper 1 and 61% on paper 2. The mean score of the 94 students was 43% on paper 1 and 45% on paper 2. The responses of ChatGPT 3.5 were also rated to be satisfactorily relevant, coherent, and complete for most of the questions (>80%). CONCLUSIONS: ChatGPT 3.5 appears to have substantial and sufficient knowledge to understand and answer the Indian medical undergraduate examination in the subject of community medicine. ChatGPT may be introduced to students to enable the self-directed learning of community medicine in pilot mode. However, faculty oversight will be required as ChatGPT is still in the initial stages of development, and thus its potential and reliability of medical content from the Indian context need to be further explored comprehensively.",Gandhi AP; Joesph FK; Rajagopal V; Aparnavi P; Katkuri S; Dayama S; Satapathy P; Khatib MN; Gaidhane S; Zahiruddin QS; Behera A
39712564,Comparative evaluation of artificial intelligence systems' accuracy in providing medical drug dosages: A methodological study.,2024,World journal of methodology,,,,"BACKGROUND: Medication errors, especially in dosage calculation, pose risks in healthcare. Artificial intelligence (AI) systems like ChatGPT and Google Bard may help reduce errors, but their accuracy in providing medication information remains to be evaluated. AIM: To evaluate the accuracy of AI systems (ChatGPT 3.5, ChatGPT 4, Google Bard) in providing drug dosage information per Harrison's Principles of Internal Medicine. METHODS: A set of natural language queries mimicking real-world medical dosage inquiries was presented to the AI systems. Responses were analyzed using a 3-point Likert scale. The analysis, conducted with Python and its libraries, focused on basic statistics, overall system accuracy, and disease-specific and organ system accuracies. RESULTS: ChatGPT 4 outperformed the other systems, showing the highest rate of correct responses (83.77%) and the best overall weighted accuracy (0.6775). Disease-specific accuracy varied notably across systems, with some diseases being accurately recognized, while others demonstrated significant discrepancies. Organ system accuracy also showed variable results, underscoring system-specific strengths and weaknesses. CONCLUSION: ChatGPT 4 demonstrates superior reliability in medical dosage information, yet variations across diseases emphasize the need for ongoing improvements. These results highlight AI's potential in aiding healthcare professionals, urging continuous development for dependable accuracy in critical medical situations.",Ramasubramanian S; Balaji S; Kannan T; Jeyaraman N; Sharma S; Migliorini F; Balasubramaniam S; Jeyaraman M
39257533,Unlocking the potential of advanced large language models in medication review and reconciliation: A proof-of-concept investigation.,2024,Exploratory research in clinical and social pharmacy,,,,"BACKGROUND: Medication review and reconciliation is essential for optimizing drug therapy and minimizing medication errors. Large language models (LLMs) have been recently shown to possess a lot of potential applications in healthcare field due to their abilities of deductive, abductive, and logical reasoning. The present study assessed the abilities of LLMs in medication review and medication reconciliation processes. METHODS: Four LLMs were prompted with appropriate queries related to dosing regimen errors, drug-drug interactions, therapeutic drug monitoring, and genomics-based decision-making process. The veracity of the LLM outputs were verified from validated sources using pre-validated criteria (accuracy, relevancy, risk management, hallucination mitigation, and citations and guidelines). The impacts of the erroneous responses on the patients' safety were categorized either as major or minor. RESULTS: In the assessment of four LLMs regarding dosing regimen errors, drug-drug interactions, and suggestions for dosing regimen adjustments based on therapeutic drug monitoring and genomics-based individualization of drug therapy, responses were generally consistent across prompts with no clear pattern in response quality among the LLMs. For identification of dosage regimen errors, ChatGPT performed well overall, except for the query related to simvastatin. In terms of potential drug-drug interactions, all LLMs recognized interactions with warfarin but missed the interaction between metoprolol and verapamil. Regarding dosage modifications based on therapeutic drug monitoring, Claude-Instant provided appropriate suggestions for two scenarios and nearly appropriate suggestions for the other two. Similarly, for genomics-based decision-making, Claude-Instant offered satisfactory responses for four scenarios, followed by Gemini for three. Notably, Gemini stood out by providing references to guidelines or citations even without prompting, demonstrating a commitment to accuracy and reliability in its responses. Minor impacts were noted in identifying appropriate dosing regimens and therapeutic drug monitoring, while major impacts were found in identifying drug interactions and making pharmacogenomic-based therapeutic decisions. CONCLUSION: Advanced LLMs hold significant promise in revolutionizing the medication review and reconciliation process in healthcare. Diverse impacts on patient safety were observed. Integrating and validating LLMs within electronic health records and prescription systems is essential to harness their full potential and enhance patient safety and care quality.",Sridharan K; Sivaramakrishnan G
38287940,Evaluating machine learning-enabled and multimodal data-driven exercise prescriptions for mental health: a randomized controlled trial protocol.,2024,Frontiers in psychiatry,,,,"BACKGROUND: Mental illnesses represent a significant global health challenge, affecting millions with far-reaching social and economic impacts. Traditional exercise prescriptions for mental health often adopt a one-size-fits-all approach, which overlooks individual variations in mental and physical health. Recent advancements in artificial intelligence (AI) offer an opportunity to tailor these interventions more effectively. OBJECTIVE: This study aims to develop and evaluate a multimodal data-driven AI system for personalized exercise prescriptions, targeting individuals with mental illnesses. By leveraging AI, the study seeks to overcome the limitations of conventional exercise regimens and improve adherence and mental health outcomes. METHODS: The study is conducted in two phases. Initially, 1,000 participants will be recruited for AI model training and testing, with 800 forming the training set, augmented by 9,200 simulated samples generated by ChatGPT, and 200 as the testing set. Data annotation will be performed by experienced physicians from the Department of Mental Health at Guangdong Second Provincial General Hospital. Subsequently, a randomized controlled trial (RCT) with 40 participants will be conducted to compare the AI-driven exercise prescriptions against standard care. Assessments will be scheduled at 6, 12, and 18 months to evaluate cognitive, physical, and psychological outcomes. EXPECTED OUTCOMES: The AI-driven system is expected to demonstrate greater effectiveness in improving mental health outcomes compared to standard exercise prescriptions. Personalized exercise regimens, informed by comprehensive data analysis, are anticipated to enhance participant adherence and overall mental well-being. These outcomes could signify a paradigm shift in exercise prescription for mental health, paving the way for more personalized and effective treatment modalities. REGISTRATION AND ETHICAL APPROVAL: This is approved by Human Experimental Ethics Inspection of Guangzhou Sport University, and the registration is under review by ChiCTR.",Tan M; Xiao Y; Jing F; Xie Y; Lu S; Xiang M; Ren H
37629452,Comparing Meta-Analyses with ChatGPT in the Evaluation of the Effectiveness and Tolerance of Systemic Therapies in Moderate-to-Severe Plaque Psoriasis.,2023,Journal of clinical medicine,,,,"BACKGROUND: Meta-analyses (MAs) and network meta-analyses (NMAs) are high-quality studies for assessing drug efficacy, but they are time-consuming and may be affected by biases. The capacity of artificial intelligence to aggregate huge amounts of information is emerging as particularly interesting for processing the volume of information needed to generate MAs. In this study, we analyzed whether the chatbot ChatGPT is able to summarize information in a useful fashion for providers and patients in a way that matches up with the results of MAs/NMAs. METHODS: We included 16 studies (13 NMAs and 3 MAs) that evaluate biologics (n = 6) and both biologic and systemic treatment (n = 10) for moderate-to-severe psoriasis, published between January 2021 and May 2023. RESULTS: The conclusions of the MAs/NMAs were compared to ChatGPT's answers to queries about the molecules evaluated in the selected MAs/NMAs. The reproducibility between the results of ChatGPT and the MAs/NMAs was random regarding drug safety. Regarding efficacy, ChatGPT reached the same conclusion as 5 out of the 16 studies (four out of four studies when three molecules were compared), gave acceptable answers in 7 out of 16 studies, and was inconclusive in 4 out of 16 studies. CONCLUSIONS: ChatGPT can generate conclusions that are similar to MAs when the efficacy of fewer drugs is compared but is still unable to summarize information in a way that matches up to the results of MAs/NMAs when more than three molecules are compared.",Lam Hoai XL; Simonart T
39184635,The Role of Artificial Intelligence in the Primary Prevention of Common Musculoskeletal Diseases.,2024,Cureus,,,,"BACKGROUND: Musculoskeletal disorders (MSDs) are a leading cause of disability worldwide, with a growing burden across all demographics. With advancements in technology, conversational artificial intelligence (AI) platforms such as ChatGPT (OpenAI, San Francisco, CA) have become instrumental in disseminating health information. This study evaluated the effectiveness of ChatGPT versions 3.5 and 4 in delivering primary prevention information for common MSDs, emphasizing that the study is focused on prevention and not on diagnosis. METHODS: This mixed-methods study employed the CLEAR tool to assess the quality of responses from ChatGPT versions in terms of completeness, lack of false information, evidence support, appropriateness, and relevance. Responses were evaluated independently by two expert raters in a blinded manner. Statistical analyses included Wilcoxon signed-rank tests and paired samples t-tests to compare the performance across versions. RESULTS: ChatGPT-3.5 and ChatGPT-4 effectively provided primary prevention information, with overall performance ranging from satisfactory to excellent. Responses for low back pain, fractures, knee osteoarthritis, neck pain, and gout received excellent scores from both versions. Additionally, ChatGPT-4 was better than ChatGPT-3.5 in terms of completeness (p = 0.015), appropriateness (p = 0.007), and relevance (p = 0.036), and ChatGPT-4 performed better across most medical conditions (p = 0.010). CONCLUSIONS: ChatGPT versions 3.5 and 4 are effective tools for disseminating primary prevention information for common MSDs, with ChatGPT-4 showing superior performance. This study underscores the potential of AI in enhancing public health strategies through reliable and accessible health communication. Advanced models such as ChatGPT-4 can effectively contribute to the primary prevention of MSDs by delivering high-quality health information, highlighting the role of AIs in addressing the global burden of chronic diseases. It is important to note that these AI tools are intended for preventive education purposes only and not for diagnostic use. Continuous improvements are necessary to fully harness the potential of AI in preventive medicine. Future studies should explore other AI platforms, languages, and secondary and tertiary prevention measures to maximize the utility of AIs in global health contexts.",Yilmaz Muluk S; Olcucu N
36909565,Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model.,2023,Research square,,,,"BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. METHODS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 - completely incorrect to 6 - completely correct) and completeness (3-point Likert scale; range 1 - incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing. RESULTS: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01). CONCLUSIONS: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.",Johnson D; Goodman R; Patrinely J; Stone C; Zimmerman E; Donald R; Chang S; Berkowitz S; Finn A; Jahangir E; Scoville E; Reese T; Friedman D; Bastarache J; van der Heijden Y; Wright J; Carter N; Alexander M; Choe J; Chastain C; Zic J; Horst S; Turker I; Agarwal R; Osmundson E; Idrees K; Kieman C; Padmanabhan C; Bailey C; Schlegel C; Chambless L; Gibson M; Osterman T; Wheless L
39156049,Evaluating cognitive performance: Traditional methods vs. ChatGPT.,2024,Digital health,,,,"BACKGROUND: NLP models like ChatGPT promise to revolutionize text-based content delivery, particularly in medicine. Yet, doubts remain about ChatGPT's ability to reliably support evaluations of cognitive performance, warranting further investigation into its accuracy and comprehensiveness in this area. METHOD: A cohort of 60 cognitively normal individuals and 30 stroke survivors underwent a comprehensive evaluation, covering memory, numerical processing, verbal fluency, and abstract thinking. Healthcare professionals and NLP models GPT-3.5 and GPT-4 conducted evaluations following established standards. Scores were compared, and efforts were made to refine scoring protocols and interaction methods to enhance ChatGPT's potential in these evaluations. RESULT: Within the cohort of healthy participants, the utilization of GPT-3.5 revealed significant disparities in memory evaluation compared to both physician-led assessments and those conducted utilizing GPT-4 (P < 0.001). Furthermore, within the domain of memory evaluation, GPT-3.5 exhibited discrepancies in 8 out of 21 specific measures when compared to assessments conducted by physicians (P < 0.05). Additionally, GPT-3.5 demonstrated statistically significant deviations from physician assessments in speech evaluation (P = 0.009). Among participants with a history of stroke, GPT-3.5 exhibited differences solely in verbal assessment compared to physician-led evaluations (P = 0.002). Notably, through the implementation of optimized scoring methodologies and refinement of interaction protocols, partial mitigation of these disparities was achieved. CONCLUSION: ChatGPT can produce evaluation outcomes comparable to traditional methods. Despite differences from physician evaluations, refinement of scoring algorithms and interaction protocols has improved alignment. ChatGPT performs well even in populations with specific conditions like stroke, suggesting its versatility. GPT-4 yields results closer to physician ratings, indicating potential for further enhancement. These findings highlight ChatGPT's importance as a supplementary tool, offering new avenues for information gathering in medical fields and guiding its ongoing development and application.",Fei X; Tang Y; Zhang J; Zhou Z; Yamamoto I; Zhang Y
39305476,Current application of ChatGPT in undergraduate nuclear medicine education: Taking Chongqing Medical University as an example.,2025,Medical teacher,,,,"BACKGROUND: Nuclear Medicine(NM), as an inherently interdisciplinary field, integrates diverse scientific principles and advanced imaging techniques. The advent of ChatGPT, a large language model, opens new avenues for medical educational innovation. With its advanced natural language processing abilities and complex algorithms, ChatGPT holds the potential to substantially enrich medical education, particularly in NM. OBJECTIVE: To investigate the current application of ChatGPT in undergraduate Nuclear Medicine Education(NME). METHODS: Employing a mixed-methods sequential explanatory design, the research investigates the current status of NME, the use of ChatGPT and the attitude towards ChatGPT among teachers and students in the Second Clinical College of Chongqing Medical University. RESULTS: The investigation yields several salient findings: (1) Students and educators in NM face numerous challenges in the learning process; (2) ChatGPT is found to possess significant applicability and potential benefits in NME; (3) There is a pronounced inclination among respondents to adopt ChatGPT, with a keen interest in its diverse applications within the educational sphere. CONCLUSION: ChatGPT has been utilized to address the difficulties faced by undergraduates at Chongqing Medical University in NME, and has been applied in various aspects to assist learning. The findings of this survey may offer some insights into how ChatGPT can be integrated into practical medical education.",Deng A; Chen W; Dai J; Jiang L; Chen Y; Chen Y; Jiang J; Rao M
40139476,Can a Large Language Model Interpret Data in the Electronic Health Record to Infer Minimum Clinically Important Difference Achievement of Knee Osteoarthritis Outcome Score-Joint Replacement Score Following Total Knee Arthroplasty?,2025,The Journal of arthroplasty,,,,"BACKGROUND: Obtaining total knee arthroplasty patient-reported outcomes for quality assessment is costly and difficult. We asked whether a large language model (LLM) could interpret electronic health record notes to differentiate patients attaining a 1-year minimum clinically important difference (MCID) for the Knee Osteoarthritis Outcome Score-Joint Replacement (KOOS-JR) from those who did not. We also investigated whether sufficient information to infer MCID achievement exists in the chart by having a blinded orthopaedic surgeon make the same determination. METHODS: In this retrospective case-control study, we selected 40 total knee arthroplasty patients who achieved 1-year KOOS-JR MCID and 40 who did not. Orthopaedic, emergency medicine, and primary care notes from zero to six months preoperatively and nine to 15 months postoperatively were deidentified. ChatGPT 3.5 (ChatGPT) interpreted these notes to determine whether the patient improved after surgery. A blinded orthopaedic surgeon classified these patients using all chart information. The sensitivity, specificity, and accuracy of ChatGPT and the surgeon's responses were calculated. RESULTS: ChatGPT classified 78 of 80 cases with 97% sensitivity, but only 33% specificity. The surgeon's assessment had 90% sensitivity and 63% specificity. Given the equal distribution of patients meeting or not meeting MCID, Chat GPT's accuracy was 65%. The surgeon's was 76%. CONCLUSIONS: ChatGPT's assessment of KOOS-JR MCID attainment had 97% sensitivity, but only 33% specificity. False positives were commonly due to the LLM not having access to, or not properly interpreting, signs of problems in the chart. This was an initial evaluation of the current ability of a general-purpose LLM to evaluate patient outcomes based on information in chart notes. An orthopaedic surgeon's assessment of the full chart suggests an opportunity to improve on this baseline performance, possibly enabling quality monitoring and identification of best practices across a large health care system. Additional work is needed to optimize model performance and confirm the utility of this approach.",Zalikha AK; Hong TS; Small EA; Constant M; Harris AHS; Giori NJ
39635018,Application of ChatGPT-4 to oculomics: a cost-effective osteoporosis risk assessment to enhance management as a proof-of-principles model in 3PM.,2024,The EPMA journal,,,,"BACKGROUND: Oculomics is an emerging medical field that focuses on the study of the eye to detect and understand systemic diseases. ChatGPT-4 is a highly advanced AI model with multimodal capabilities, allowing it to process text and statistical data. Osteoporosis is a chronic condition presenting asymptomatically but leading to fractures if untreated. Current diagnostic methods like dual X-ray absorptiometry (DXA) are costly and involve radiation exposure. This study aims to develop a cost-effective osteoporosis risk prediction tool using ophthalmological data and ChatGPT-4 based on oculomics, aligning with predictive, preventive, and personalized medicine (3PM) principles. WORKING HYPOTHESIS AND METHODS: We hypothesize that leveraging ophthalmological data (oculomics) combined with AI-driven regression models developed by ChatGPT-4 can significantly improve the predictive accuracy for osteoporosis risk. This integration will facilitate earlier detection, enable more effective preventive strategies, and support personalized treatment plans tailored to individual patients. We utilized DXA and ophthalmological data from the Korea National Health and Nutrition Examination Survey to develop and validate osteopenia and osteoporosis prediction models. Ophthalmological and demographic data were integrated into logistic regression analyses, facilitated by ChatGPT-4, to create prediction formulas. These models were then converted into calculator software through automated coding by ChatGPT-4. RESULTS: ChatGPT-4 automatically developed prediction models based on key predictors of osteoporosis and osteopenia included age, gender, weight, and specific ophthalmological conditions such as cataracts and early age-related macular degeneration, and successfully implemented a risk calculator tool. The oculomics-based models outperformed traditional methods, with area under the curve of the receiver operating characteristic values of 0.785 for osteopenia and 0.866 for osteoporosis in the validation set. The calculator demonstrated high sensitivity and specificity, providing a reliable tool for early osteoporosis screening. CONCLUSIONS AND EXPERT RECOMMENDATIONS IN THE FRAMEWORK OF 3PM: This study illustrates the value of integrating ophthalmological data into multi-level diagnostics for osteoporosis, significantly improving the accuracy of health risk assessment and the identification of at-risk individuals. Aligned with the principles of 3PM, this approach fosters earlier detection and enables the development of individualized patient profiles, facilitating personalized and targeted treatment strategies. This study also highlights the potential of AI, specifically ChatGPT-4, in developing accessible, cost-effective, and radiation-free screening tools for advancing 3PM in clinical practice. Our findings emphasize the importance of a holistic approach, incorporating comprehensive health indices and interdisciplinary collaboration, to deliver personalized management plans. Preventive strategies should focus on lifestyle modifications and targeted interventions to enhance bone health, thereby preventing the progression of osteoporosis and contributing to overall patient well-being. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s13167-024-00378-0.",Choi JY; Han E; Yoo TK
38976865,ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis.,2024,Journal of medical Internet research,,,,"BACKGROUND: OpenAI's ChatGPT is a pioneering artificial intelligence (AI) in the field of natural language processing, and it holds significant potential in medicine for providing treatment advice. Additionally, recent studies have demonstrated promising results using ChatGPT for emergency medicine triage. However, its diagnostic accuracy in the emergency department (ED) has not yet been evaluated. OBJECTIVE: This study compares the diagnostic accuracy of ChatGPT with GPT-3.5 and GPT-4 and primary treating resident physicians in an ED setting. METHODS: Among 100 adults admitted to our ED in January 2023 with internal medicine issues, the diagnostic accuracy was assessed by comparing the diagnoses made by ED resident physicians and those made by ChatGPT with GPT-3.5 or GPT-4 against the final hospital discharge diagnosis, using a point system for grading accuracy. RESULTS: The study enrolled 100 patients with a median age of 72 (IQR 58.5-82.0) years who were admitted to our internal medicine ED primarily for cardiovascular, endocrine, gastrointestinal, or infectious diseases. GPT-4 outperformed both GPT-3.5 (P<.001) and ED resident physicians (P=.01) in diagnostic accuracy for internal medicine emergencies. Furthermore, across various disease subgroups, GPT-4 consistently outperformed GPT-3.5 and resident physicians. It demonstrated significant superiority in cardiovascular (GPT-4 vs ED physicians: P=.03) and endocrine or gastrointestinal diseases (GPT-4 vs GPT-3.5: P=.01). However, in other categories, the differences were not statistically significant. CONCLUSIONS: In this study, which compared the diagnostic accuracy of GPT-3.5, GPT-4, and ED resident physicians against a discharge diagnosis gold standard, GPT-4 outperformed both the resident physicians and its predecessor, GPT-3.5. Despite the retrospective design of the study and its limited sample size, the results underscore the potential of AI as a supportive diagnostic tool in ED settings.",Hoppe JM; Auer MK; Struven A; Massberg S; Stremmel C
39064053,Assessing the Accuracy of Artificial Intelligence Models in Scoliosis Classification and Suggested Therapeutic Approaches.,2024,Journal of clinical medicine,,,,"Background: Open-source artificial intelligence models (OSAIMs) are increasingly being applied in various fields, including IT and medicine, offering promising solutions for diagnostic and therapeutic interventions. In response to the growing interest in AI for clinical diagnostics, we evaluated several OSAIMs-such as ChatGPT 4, Microsoft Copilot, Gemini, PopAi, You Chat, Claude, and the specialized PMC-LLaMA 13B-assessing their abilities to classify scoliosis severity and recommend treatments based on radiological descriptions from AP radiographs. Methods: Our study employed a two-stage methodology, where descriptions of single-curve scoliosis were analyzed by AI models following their evaluation by two independent neurosurgeons. Statistical analysis involved the Shapiro-Wilk test for normality, with non-normal distributions described using medians and interquartile ranges. Inter-rater reliability was assessed using Fleiss' kappa, and performance metrics, like accuracy, sensitivity, specificity, and F1 scores, were used to evaluate the AI systems' classification accuracy. Results: The analysis indicated that although some AI systems, like ChatGPT 4, Copilot, and PopAi, accurately reflected the recommended Cobb angle ranges for disease severity and treatment, others, such as Gemini and Claude, required further calibration. Particularly, PMC-LLaMA 13B expanded the classification range for moderate scoliosis, potentially influencing clinical decisions and delaying interventions. Conclusions: These findings highlight the need for the continuous refinement of AI models to enhance their clinical applicability.",Fabijan A; Zawadzka-Fabijan A; Fabijan R; Zakrzewski K; Nowoslawska E; Polis B
37987870,"""ChatGPT, Can You Help Me Save My Child's Life?"" - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases - An In-silico Analysis.",2023,Journal of medical systems,,,,"BACKGROUND: Paediatric emergencies are challenging for healthcare workers, first aiders, and parents waiting for emergency medical services to arrive. With the expected rise of virtual assistants, people will likely seek help from such digital AI tools, especially in regions lacking emergency medical services. Large Language Models like ChatGPT proved effective in providing health-related information and are competent in medical exams but are questioned regarding patient safety. Currently, there is no information on ChatGPT's performance in supporting parents in paediatric emergencies requiring help from emergency medical services. This study aimed to test 20 paediatric and two basic life support case vignettes for ChatGPT and GPT-4 performance and safety in children. METHODS: We provided the cases three times each to two models, ChatGPT and GPT-4, and assessed the diagnostic accuracy, emergency call advice, and the validity of advice given to parents. RESULTS: Both models recognized the emergency in the cases, except for septic shock and pulmonary embolism, and identified the correct diagnosis in 94%. However, ChatGPT/GPT-4 reliably advised to call emergency services only in 12 of 22 cases (54%), gave correct first aid instructions in 9 cases (45%) and incorrectly advised advanced life support techniques to parents in 3 of 22 cases (13.6%). CONCLUSION: Considering these results of the recent ChatGPT versions, the validity, reliability and thus safety of ChatGPT/GPT-4 as an emergency support tool is questionable. However, whether humans would perform better in the same situation is uncertain. Moreover, other studies have shown that human emergency call operators are also inaccurate, partly with worse performance than ChatGPT/GPT-4 in our study. However, one of the main limitations of the study is that we used prototypical cases, and the management may differ from urban to rural areas and between different countries, indicating the need for further evaluation of the context sensitivity and adaptability of the model. Nevertheless, ChatGPT and the new versions under development may be promising tools for assisting lay first responders, operators, and professionals in diagnosing a paediatric emergency. TRIAL REGISTRATION: Not applicable.",Bushuven S; Bentele M; Bentele S; Gerber B; Bansbach J; Ganter J; Trifunovic-Koenig M; Ranisch R
39445873,Comparing Provider and ChatGPT Responses to Breast Reconstruction Patient Questions in the Electronic Health Record.,2024,Annals of plastic surgery,,,,"BACKGROUND: Patient-directed Electronic Health Record (EHR) messaging is used as an adjunct to enhance patient-physician interactions but further burdens the physician. There is a need for clear electronic patient communication in all aspects of medicine, including plastic surgery. We can potentially utilize innovative communication tools like ChatGPT. This study assesses ChatGPT's effectiveness in answering breast reconstruction queries, comparing its accuracy, empathy, and readability with healthcare providers' responses. METHODS: Ten deidentified questions regarding breast reconstruction were extracted from electronic messages. They were presented to ChatGPT3, ChatGPT4, plastic surgeons, and advanced practice providers for response. ChatGPT3 and ChatGPT4 were also prompted to give brief responses. Using 1-5 Likert scoring, accuracy and empathy were graded by 2 plastic surgeons and medical students, respectively. Readability was measured using Flesch Reading Ease. Grades were compared using 2-tailed t tests. RESULTS: Combined provider responses had better Flesch Reading Ease scores compared to all combined chatbot responses (53.3 +/- 13.3 vs 36.0 +/- 11.6, P < 0.001) and combined brief chatbot responses (53.3 +/- 13.3 vs 34.7 +/- 12.8, P < 0.001). Empathy scores were higher in all combined chatbot than in those from combined providers (2.9 +/- 0.8 vs 2.0 +/- 0.9, P < 0.001). There were no statistically significant differences in accuracy between combined providers and all combined chatbot responses (4.3 +/- 0.9 vs 4.5 +/- 0.6, P = 0.170) or combined brief chatbot responses (4.3 +/- 0.9 vs 4.6 +/- 0.6, P = 0.128). CONCLUSIONS: Amid the time constraints and complexities of plastic surgery decision making, our study underscores ChatGPT's potential to enhance patient communication. ChatGPT excels in empathy and accuracy, yet its readability presents limitations that should be addressed.",Soroudi D; Gozali A; Knox JA; Parmeshwar N; Sadjadi R; Wilson JC; Lee SA; Piper ML
39896176,Exploring potential drug-drug interactions in discharge prescriptions: ChatGPT's effectiveness in assessing those interactions.,2025,Exploratory research in clinical and social pharmacy,,,,"BACKGROUND: Potential drug-drug interactions (pDDIs) pose substantial risks in clinical practice, leading to increased morbidity, mortality, and healthcare costs. Tools like Micromedex drug-drug interaction checker are commonly used to screen for pDDIs, yet emerging AI models, such as ChatGPT, offer the potential for supplementary pDDI prediction. However, the accuracy and reliability of these AI tools in a clinical context remain largely untested. OBJECTIVE: This study evaluates pDDIs in discharge prescriptions for medical ward patients and assesses ChatGPT-4.0's effectiveness in predicting these interactions compared to Micromedex drug-drug interaction checker. METHOD: A cross-sectional study was conducted over three months with 301 discharged patients. pDDIs were identified using Micromedex drug-drug interaction checker, detailing each interaction's occurrence, severity, onset, and documentation. ChatGPT-4.0 predictions were then analyzed against Micromedex data. Binary logistic regression analysis was applied to assess the influence of predictor variables in the occurrence of pDDIs. RESULTS: 1551 drugs were prescribed to 301 patients, averaging 5.15 per patient. pDDIs were detected in 60.13 % of patients, averaging 3.17 pDDIs per patient, ChatGPT-4.0 accurately identified pDDIs (100 % for occurrence) but had limited accuracy for severity (37.3 %) and moderate accuracy for onset (65.2 %). The most frequent major interaction was between Cefuroxime Axetil and Pantoprazole Sodium. Polypharmacy significantly increased the risk of pDDIs (OR: 3.960, p < 0.001). CONCLUSION: pDDIs are prevalent in internal medicine discharge prescriptions, with polypharmacy heightening the risk. While ChatGPT 4.0 accurately identifies pDDI occurrence, its limitations in predicting severity, onset, and documentation underscore healthcare professionals' need for careful oversight.",Thapa RB; Karki S; Shrestha S
39864953,Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study.,2025,JMIR medical informatics,,,,"BACKGROUND: Prediction models have demonstrated a range of applications across medicine, including using electronic health record (EHR) data to identify hospital readmission and mortality risk. Large language models (LLMs) can transform unstructured EHR text into structured features, which can then be integrated into statistical prediction models, ensuring that the results are both clinically meaningful and interpretable. OBJECTIVE: This study aims to compare the classification decisions made by clinical experts with those generated by a state-of-the-art LLM, using terms extracted from a large EHR data set of individuals with mental health disorders seen in emergency departments (EDs). METHODS: Using a dataset from the EHR systems of more than 50 health care provider organizations in the United States from 2016 to 2021, we extracted all clinical terms that appeared in at least 1000 records of individuals admitted to the ED for a mental health-related problem from a source population of over 6 million ED episodes. Two experienced mental health clinicians (one medically trained psychiatrist and one clinical psychologist) reached consensus on the classification of EHR terms and diagnostic codes into categories. We evaluated an LLM's agreement with clinical judgment across three classification tasks as follows: (1) classify terms into ""mental health"" or ""physical health"", (2) classify mental health terms into 1 of 42 prespecified categories, and (3) classify physical health terms into 1 of 19 prespecified broad categories. RESULTS: There was high agreement between the LLM and clinical experts when categorizing 4553 terms as ""mental health"" or ""physical health"" (kappa=0.77, 95% CI 0.75-0.80). However, there was still considerable variability in LLM-clinician agreement on the classification of mental health terms (kappa=0.62, 95% CI 0.59-0.66) and physical health terms (kappa=0.69, 95% CI 0.67-0.70). CONCLUSIONS: The LLM displayed high agreement with clinical experts when classifying EHR terms into certain mental health or physical health term categories. However, agreement with clinical experts varied considerably within both sets of mental and physical health term categories. Importantly, the use of LLMs presents an alternative to manual human coding, presenting great potential to create interpretable features for prediction models.",Cardamone NC; Olfson M; Schmutte T; Ungar L; Liu T; Cullen SW; Williams NJ; Marcus SC
38470459,Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.,2024,JMIR medical education,,,,"BACKGROUND: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. OBJECTIVE: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. METHODS: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. RESULTS: Among the 108 questions with images, GPT-4V's accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P>/=.99), respectively. CONCLUSIONS: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination.",Nakao T; Miki S; Nakamura Y; Kikuchi T; Nomura Y; Hanaoka S; Yoshikawa T; Abe O
40164490,Evaluating ChatGPT for converting clinic letters into patient-friendly language.,2025,BJGP open,,,,"BACKGROUND: Previous research has shown that communication with patients in language they understand leads to greater comprehension of treatment and diagnoses but can be time consuming for clinicians. AIM: Here we sought to investigate the utility of ChatGPT to translate clinic letters into language patients understood, without loss of clinical information and to assess what impact this had on patients understanding of letter content. DESIGN & SETTING: Single blinded quantitative study using objective and subjective analysis of language complexity. METHOD: Twenty-three clinic letters were provided by consultants across 8 specialties. Letters were inputted into ChatGPT with a prompt related to improve understanding for patients. Patient representatives were then asked to rate their understanding of the content of letters. RESULTS: Translation of letters by ChatGPT resulted in no loss of clinical information, but did result in significant increase in understanding, satisfaction and decrease in the need to obtain medical help to translate the letter contents by patient representatives compared with clinician written originals. CONCLUSION: Overall, we conclude that ChatGPT can be used to translate clinic letters into patient friendly language without loss of clinical content, and that these letters are preferred by patients.",Cork S; Hopcroft K
39255030,Prompt Engineering Paradigms for Medical Applications: Scoping Review.,2024,Journal of medical Internet research,,,,"BACKGROUND: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.",Zaghir J; Naguib M; Bjelogrlic M; Neveol A; Tannier X; Lovis C
38239905,ChatGPT is not ready yet for use in providing mental health assessment and interventions.,2023,Frontiers in psychiatry,,,,"BACKGROUND: Psychiatry is a specialized field of medicine that focuses on the diagnosis, treatment, and prevention of mental health disorders. With advancements in technology and the rise of artificial intelligence (AI), there has been a growing interest in exploring the potential of AI language models systems, such as Chat Generative Pre-training Transformer (ChatGPT), to assist in the field of psychiatry. OBJECTIVE: Our study aimed to evaluates the effectiveness, reliability and safeness of ChatGPT in assisting patients with mental health problems, and to assess its potential as a collaborative tool for mental health professionals through a simulated interaction with three distinct imaginary patients. METHODS: Three imaginary patient scenarios (cases A, B, and C) were created, representing different mental health problems. All three patients present with, and seek to eliminate, the same chief complaint (i.e., difficulty falling asleep and waking up frequently during the night in the last 2 degrees weeks). ChatGPT was engaged as a virtual psychiatric assistant to provide responses and treatment recommendations. RESULTS: In case A, the recommendations were relatively appropriate (albeit non-specific), and could potentially be beneficial for both users and clinicians. However, as complexity of clinical cases increased (cases B and C), the information and recommendations generated by ChatGPT became inappropriate, even dangerous; and the limitations of the program became more glaring. The main strengths of ChatGPT lie in its ability to provide quick responses to user queries and to simulate empathy. One notable limitation is ChatGPT inability to interact with users to collect further information relevant to the diagnosis and management of a patient's clinical condition. Another serious limitation is ChatGPT inability to use critical thinking and clinical judgment to drive patient's management. CONCLUSION: As for July 2023, ChatGPT failed to give the simple medical advice given certain clinical scenarios. This supports that the quality of ChatGPT-generated content is still far from being a guide for users and professionals to provide accurate mental health information. It remains, therefore, premature to conclude on the usefulness and safety of ChatGPT in mental health practice.",Dergaa I; Fekih-Romdhane F; Hallit S; Loch AA; Glenn JM; Fessi MS; Ben Aissa M; Souissi N; Guelmami N; Swed S; El Omri A; Bragazzi NL; Ben Saad H
37040823,Using a Google Web Search Analysis to Assess the Utility of ChatGPT in Total Joint Arthroplasty.,2023,The Journal of arthroplasty,,,,"BACKGROUND: Rapid technological advancements have laid the foundations for the use of artificial intelligence in medicine. The promise of machine learning (ML) lies in its potential ability to improve treatment decision making, predict adverse outcomes, and streamline the management of perioperative healthcare. In an increasing consumer-focused health care model, unprecedented access to information may extend to patients using ChatGPT to gain insight into medical questions. The main objective of our study was to replicate a patient's internet search in order to assess the appropriateness of ChatGPT, a novel machine learning tool released in 2022 that provides dialogue responses to queries, in comparison to Google Web Search, the most widely used search engine in the United States today, as a resource for patients for online health information. For the 2 different search engines, we compared i) the most frequently asked questions (FAQs) associated with total knee arthroplasty (TKA) and total hip arthroplasty (THA) by question type and topic; ii) the answers to the most frequently asked questions; as well as iii) the FAQs yielding a numerical response. METHODS: A Google web search was performed with the following search terms: ""total knee replacement"" and ""total hip replacement."" These terms were individually entered and the first 10 FAQs were extracted along with the source of the associated website for each question. The following statements were inputted into ChatGPT: 1) ""Perform a google search with the search term 'total knee replacement' and record the 10 most FAQs related to the search term"" as well as 2) ""Perform a google search with the search term 'total hip replacement' and record the 10 most FAQs related to the search term."" A Google web search was repeated with the same search terms to identify the first 10 FAQs that included a numerical response for both ""total knee replacement"" and ""total hip replacement."" These questions were then inputted into ChatGPT and the questions and answers were recorded. RESULTS: There were 5 of 20 (25%) questions that were similar when performing a Google web search and a search of ChatGPT for all search terms. Of the 20 questions asked for the Google Web Search, 13 of 20 were provided by commercial websites. For ChatGPT, 15 of 20 (75%) questions were answered by government websites, with the most frequent one being PubMed. In terms of numerical questions, 11 of 20 (55%) of the most FAQs provided different responses between a Google web search and ChatGPT. CONCLUSION: A comparison of the FAQs by a Google web search with attempted replication by ChatGPT revealed heterogenous questions and responses for open and discrete questions. ChatGPT should remain a trending use as a potential resource to patients that needs further corroboration until its ability to provide credible information is verified and concordant with the goals of the physician and the patient alike.",Dubin JA; Bains SS; Chen Z; Hameed D; Nace J; Mont MA; Delanois RE
39230947,Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.,2024,JMIR medical informatics,,,,"BACKGROUND: Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. OBJECTIVE: This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study. METHODS: The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs' understanding of different sections of a research paper. RESULTS: LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper-with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding. CONCLUSIONS: This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models.",Akyon SH; Akyon FC; Camyar AS; Hizli F; Sari T; Hizli S
38148925,"The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard.",2024,Journal of orthopaedics,,,,"BACKGROUND: Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases. QUESTIONS/PURPOSES: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians. PATIENTS AND METHODS: The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses. RESULTS: In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (proportion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, P value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications. CONCLUSIONS: While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision-making. LEVEL OF EVIDENCE: IV.",Agharia S; Szatkowski J; Fraval A; Stevens J; Zhou Y
40209205,Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.,2025,JMIR medical education,,,,"BACKGROUND: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation. OBJECTIVE: The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots-Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)-against the academic results of medical students in the medical biochemistry course. METHODS: We used 200 USMLE (United States Medical Licensing Examination)-style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4-1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data's basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05. RESULTS: On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students' performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04). CONCLUSIONS: Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.",Bolgova O; Shypilova I; Mavrych V
40001164,Quality assurance and validity of AI-generated single best answer questions.,2025,BMC medical education,,,,"BACKGROUND: Recent advancements in generative artificial intelligence (AI) have opened new avenues in educational methodologies, particularly in medical education. This study seeks to assess whether generative AI might be useful in addressing the depletion of assessment question banks, a challenge intensified during the Covid-era due to the prevalence of open-book examinations, and to augment the pool of formative assessment opportunities available to students. While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself. This research utilized a commercially available AI large language model (LLM), OpenAI GPT-4, to generate 220 single best answer (SBA) questions, adhering to Medical Schools Council Assessment Alliance guidelines the and a selection of Learning Outcomes (LOs) of the Scottish Graduate-Entry Medicine (ScotGEM) program. All questions were assessed by an expert panel for accuracy and quality. A total of 50 AI-generated and 50 human-authored questions were used to create two 50-item formative SBA examinations for Year 1 and Year 2 ScotGEM students. Each exam, delivered via the Speedwell eSystem, comprised 25 AI-generated and 25 human-authored questions presented in random order. Students completed the online, closed-book exams on personal devices under exam conditions that reflected summative examinations. The performance of both AI-generated and human-authored questions was evaluated, focusing on facility and discrimination index as key metrics. The screening process revealed that 69% of AI-generated SBAs were fit for inclusion in the examinations with little or no modifications required. Modifications, when necessary, were predominantly due to reasons such as the inclusion of ""all of the above"" options, usage of American English spellings, and non-alphabetized answer choices. 31% of questions were rejected for inclusion in the examinations, due to factual inaccuracies and non-alignment with students' learning. When included in an examination, post hoc statistical analysis indicated no significant difference in performance between the AI- and human- authored questions in terms of facility and discrimination index. DISCUSSION AND CONCLUSION: The outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs. However, a robust quality assurance process is necessary to ensure that erroneous questions are identified and rejected. The insights gained from this research provide a foundation for further investigation into refining AI prompts, aiming for a more reliable generation of curriculum-aligned questions. LLMs show significant potential in supplementing traditional methods of question generation in medical education. This approach offers a viable solution to rapidly replenish and diversify assessment resources in medical curricula, marking a step forward in the intersection of AI and education.",Ahmed A; Kerr E; O'Malley A
39730155,ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study.,2024,BMJ open,,,,"BACKGROUND: Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care. OBJECTIVES: To compare the performance of ChatGPT, version GPT-4, with that of real doctors. DESIGN AND SETTING: A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared. PARTICIPANTS: Anonymous responses from the Swedish family medicine specialist examination 2017-2022 were used. OUTCOME MEASURES: Primary: the mean difference in scores between GPT-4's responses and randomly selected responses by human doctors, as well as between GPT-4's responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories. RESULTS: The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044). CONCLUSION: In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.",Arvidsson R; Gunnarsson R; Entezarjou A; Sundemo D; Wikberg C
39504445,ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.,2024,JMIR medical education,,,,"BACKGROUND: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. OBJECTIVE: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. METHODS: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances. RESULTS: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3-60.3). CONCLUSIONS: GPT-4o's performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.",Bicknell BT; Butler D; Whalen S; Ricks J; Dixon CJ; Clark AB; Spaedy O; Skelton A; Edupuganti N; Dzubinski L; Tate H; Dyess G; Lindeman B; Lehmann LS
39496149,Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.,2024,Journal of medical Internet research,,,,"BACKGROUND: Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice. OBJECTIVE: We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions. METHODS: We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test. RESULTS: Each of the 4 chatbots' responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise). CONCLUSIONS: AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes.",Yau JY; Saadat S; Hsu E; Murphy LS; Roh JS; Suchard J; Tapia A; Wiechmann W; Langdorf MI
39382347,Registered Nurses' Attitudes Towards ChatGPT and Self-Directed Learning: A Cross-Sectional Study.,2024,Journal of advanced nursing,,,,"BACKGROUND: Self-directed, lifelong learning is essential for nurses' competence in complex healthcare environments, which are characterised by rapid advancements in medicine and technology and nursing shortages. Previous studies have demonstrated that ChatGPT technology fosters self-directed learning by motivating users to engage with it. OBJECTIVES: To explore the relationships amongst socio-demographic data, attitudes towards ChatGPT use, and self-directed learning amongst registered nurses in Taiwan. METHODS: A cross-sectional study design with an online survey was adopted. Registered nurses from various healthcare settings were recruited through Facebook and LINE, a widely used messaging application in East Asia, reaching over 1000 nurses across five distinct online groups. An online survey was used to collect data, including socio-demographic characteristics, attitudes towards ChatGPT use, and a self-directed learning scale. Data were analysed using descriptive statistical methods, t-tests, Pearson's correlation, one-way analysis of variance, and multiple linear regression analysis. RESULTS: Amongst the 330 participants, 50.6% worked in hospitals, 51.8% had more than 15 years of work experience, and 78.2% did not hold supervisory positions. Of the participants, 46.7% had used ChatGPT. For all nurses, work experience and awareness of ChatGPT statistically significantly predicted self-directed learning, explaining 32.0% of the variance. For those familiar with ChatGPT, work experience in nursing and the technological/social influence of ChatGPT statistically significantly predicted self-directed learning, explaining 35.3% of the variance. CONCLUSIONS: Work experience in nursing provides critical opportunities for professional development and training. Therefore, ChatGPT-supported self-directed learning should be customised for degrees of experience to optimise continuous education. IMPLICATIONS FOR NURSING MANAGEMENT AND HEALTH POLICY: This study explores nurses' diverse use of and attitudes towards ChatGPT for self-directed learning. It suggests that administrators customise support and training when incorporating ChatGPT into professional development, accounting for nurses' varied experiences to enhance learning outcomes. PATIENT OR PUBLIC CONTRIBUTION: No patient or public contribution. REPORTING METHOD: This study adhered to the relevant cross-sectional STROBE guidelines.",Chang LC; Wang YN; Lin HL; Liao LL
40229613,Man Versus Machine: A Comparative Study of Human and ChatGPT-Generated Abstracts in Plastic Surgery Research.,2025,Aesthetic plastic surgery,,,,"BACKGROUND: Since its 2022 release, ChatGPT has gained recognition for its potential to expedite time-consuming writing tasks like scientific writing. Well-written scientific abstracts are essential for clear and efficient communication of research findings. This study aims to explore ChatGPT-4's capability to produce well-crafted abstracts. METHODS: Ten abstract-less plastic surgery articles from PubMed were uploaded to ChatGPT, each with a prompt to generate one abstract. Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES) were calculated for all abstracts. Additionally, three physician evaluators blindly assessed the ten original and ten ChatGPT-generated abstracts using a 5-point Likert scale. Results were compared and analyzed using descriptive statistics with mean and standard deviation (SD). RESULTS: The original abstracts averaged an FKGL of 14.1 (SD 2.9) and an FRES of 25.2 (SD 14.2), while ChatGPT-generated abstracts had scores of 15.6 (SD 2.4) and 15.4 (SD 13.1), respectively. Collectively, evaluators identified two-thirds of the ChatGPT abstracts, but preferred the ChatGPT abstracts 90% of the time. On average, the evaluators found the ChatGPT abstracts to be more ""well written"" (4.23 vs. 3.50, p value < 0.001) and ""clear and concise"" (4.30 vs. 3.53, p value < 0.001) compared to the original abstracts. CONCLUSIONS: Despite a slightly higher reading level, evaluators generally preferred ChatGPT abstracts, which received higher ratings overall. These findings suggest ChatGPT holds promise in expediting the creation of high-quality scientific abstracts, potentially enhancing efficiency in research and scientific writing tasks. However, due to its exploratory nature, this study calls for additional research to validate these promising findings. LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors   www.springer.com/00266.",Pressman SM; Garcia JP; Borna S; Gomez-Cabello CA; Haider SA; Haider CR; Forte AJ
37561097,Examining Real-World Medication Consultations and Drug-Herb Interactions: ChatGPT Performance Evaluation.,2023,JMIR medical education,,,,"BACKGROUND: Since OpenAI released ChatGPT, with its strong capability in handling natural tasks and its user-friendly interface, it has garnered significant attention. OBJECTIVE: A prospective analysis is required to evaluate the accuracy and appropriateness of medication consultation responses generated by ChatGPT. METHODS: A prospective cross-sectional study was conducted by the pharmacy department of a medical center in Taiwan. The test data set comprised retrospective medication consultation questions collected from February 1, 2023, to February 28, 2023, along with common questions about drug-herb interactions. Two distinct sets of questions were tested: real-world medication consultation questions and common questions about interactions between traditional Chinese and Western medicines. We used the conventional double-review mechanism. The appropriateness of each response from ChatGPT was assessed by 2 experienced pharmacists. In the event of a discrepancy between the assessments, a third pharmacist stepped in to make the final decision. RESULTS: Of 293 real-world medication consultation questions, a random selection of 80 was used to evaluate ChatGPT's performance. ChatGPT exhibited a higher appropriateness rate in responding to public medication consultation questions compared to those asked by health care providers in a hospital setting (31/51, 61% vs 20/51, 39%; P=.01). CONCLUSIONS: The findings from this study suggest that ChatGPT could potentially be used for answering basic medication consultation questions. Our analysis of the erroneous information allowed us to identify potential medical risks associated with certain questions; this problem deserves our close attention.",Hsu HY; Hsu KC; Hou SY; Wu CL; Hsieh YW; Cheng YD
37456381,Snakebite Advice and Counseling From Artificial Intelligence: An Acute Venomous Snakebite Consultation With ChatGPT.,2023,Cureus,,,,"BACKGROUND: Snakebites, particularly from venomous species, present a significant global public health challenge. Access to accurate and timely information regarding snakebite prevention, recognition, and management is crucial for minimizing morbidity and mortality. Artificial intelligence (AI) language models, such as ChatGPT (Chat Generative Pre-trained Transformer), have the potential to revolutionize the dissemination of medical information and improve patient education and satisfaction. METHODS: This study aimed to explore the utility of ChatGPT, an advanced language model, in simulating acute venomous snakebite consultations. Nine hypothetical questions based on comprehensive snakebite management guidelines were posed to ChatGPT, and the responses were evaluated by clinical toxicologists and emergency medicine physicians. RESULTS: ChatGPT provided accurate and informative responses related to the immediate management of snakebites, the urgency of seeking medical attention, symptoms, and health issues following venomous snakebites, the role of antivenom, misconceptions about snakebites, recovery, pain management, and prevention strategies. The model highlighted the importance of seeking professional medical care and adhering to healthcare practitioners' advice. However, some limitations were identified, including outdated knowledge, lack of personalization, and inability to consider regional variations and individual characteristics. CONCLUSION: ChatGPT demonstrated proficiency in generating intelligible and well-informed responses related to venomous snakebites. It offers accessible and real-time advice, making it a valuable resource for preliminary information, education, and triage support in remote or underserved areas. While acknowledging its limitations, such as the need for up-to-date information and personalized advice, ChatGPT can serve as a supplementary source of information to complement professional medical consultation and enhance patient education. Future research should focus on addressing the identified limitations and establishing region-specific guidelines for snakebite management.",Altamimi I; Altamimi A; Alhumimidi AS; Altamimi A; Temsah MH
40393017,Assessing ChatGPT's Capability as a New Age Standardized Patient: Qualitative Study.,2025,JMIR medical education,,,,"BACKGROUND: Standardized patients (SPs) have been crucial in medical education, offering realistic patient interactions to students. Despite their benefits, SP training is resource-intensive and access can be limited. Advances in artificial intelligence (AI), particularly with large language models such as ChatGPT, present new opportunities for virtual SPs, potentially addressing these limitations. OBJECTIVES: This study aims to assess medical students' perceptions and experiences of using ChatGPT as an SP and to evaluate ChatGPT's effectiveness in performing as a virtual SP in a medical school setting. METHODS: This qualitative study, approved by the American University of Antigua Institutional Review Board, involved 9 students (5 females and 4 males, aged 22-48 years) from the American University of Antigua College of Medicine. Students were observed during a live role-play, interacting with ChatGPT as an SP using a predetermined prompt. A structured 15-question survey was administered before and after the interaction. Thematic analysis was conducted on the transcribed and coded responses, with inductive category formation. RESULTS: Thematic analysis identified key themes preinteraction including technology limitations (eg, prompt engineering difficulties), learning efficacy (eg, potential for personalized learning and reduced interview stress), verisimilitude (eg, absence of visual cues), and trust (eg, concerns about AI accuracy). Postinteraction, students noted improvements in prompt engineering, some alignment issues (eg, limited responses on sensitive topics), maintained learning efficacy (eg, convenience and repetition), and continued verisimilitude challenges (eg, lack of empathy and nonverbal cues). No significant trust issues were reported postinteraction. Despite some limitations, students found ChatGPT as a valuable supplement to traditional SPs, enhancing practice flexibility and diagnostic skills. CONCLUSIONS: ChatGPT can effectively augment traditional SPs in medical education, offering accessible, flexible practice opportunities. However, it cannot fully replace human SPs due to limitations in verisimilitude and prompt engineering challenges. Integrating prompt engineering into medical curricula and continuous advancements in AI are recommended to enhance the use of virtual SPs.",Cross J; Kayalackakom T; Robinson RE; Vaughans A; Sebastian R; Hood R; Lewis C; Devaraju S; Honnavar P; Naik S; Joseph J; Anand N; Mohammed A; Johnson A; Cohen E; Adeniji T; Nnenna Nnaji A; George JE
37371718,"PRISMA Systematic Literature Review, including with Meta-Analysis vs. Chatbot/GPT (AI) regarding Current Scientific Data on the Main Effects of the Calf Blood Deproteinized Hemoderivative Medicine (Actovegin) in Ischemic Stroke.",2023,Biomedicines,,,,"BACKGROUND: Stroke is a significant public health problem and a leading cause of death and long-term disability worldwide. Several treatments for ischemic stroke have been developed, but these treatments have limited effectiveness. One potential treatment for this condition is Actovegin((R))/AODEJIN, a calf blood deproteinized hemodialysate/ultrafiltrate that has been shown to have pleiotropic/multifactorial and possibly multimodal effects. The actual actions of this medicine are thought to be mediated by its ability to reduce oxidative stress, inflammation, and apoptosis and to enhance neuronal survival and plasticity. METHODS: To obtain the most up-to-date information on the effects of Actovegin((R))/AODEJIN in ischemic stroke, we systematically reviewed the literature published in the last two years. This review builds upon our previous systematic literature review published in 2020, which used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method to search for and select related articles over almost two decades, between 1 January 2001 and 31 December 2019. Additionally, we compared the results of our PRISMA search (human intelligence-based) with those obtained from an interrogation of a GPT-based chatbot (ChatGPT) in order to ensure comprehensive coverage of potentially relevant studies. RESULTS: Our updated review found limited new evidence on the use of Actovegin((R))/AODEJIN in ischemic stroke, although the number of articles on this subject consistently increased compared to that from our initial systematic literature review. Specifically, we found five articles up to 2020 and eight more until December 2022. While these studies suggest that Actovegin((R))/AODEJIN may have neuroprotective effects in ischemic stroke, further clinical trials are needed to confirm these findings. Consequently, we performed a funnel analysis to evaluate the potential for publication bias. DISCUSSION: Our funnel analysis showed no evidence of publication bias, suggesting that the limited number of studies identified was not due to publication bias but rather due to a lack of research in this area. However, there are limitations when using ChatGPT, particularly in distinguishing between truth and falsehood and determining the appropriateness of interpolation. Nevertheless, AI can provide valuable support in conducting PRISMA-type systematic literature reviews, including meta-analyses. CONCLUSIONS: The limited number of studies identified in our review highlights the need for additional research in this area, especially as no available therapeutic agents are capable of curing central nervous system lesions. Any contribution, including that of Actovegin (with consideration of a positive balance between benefits and risks), is worthy of further study and periodic reappraisal. The evolving advancements in AI may play a role in the near future.",Anghelescu A; Firan FC; Onose G; Munteanu C; Trandafir AI; Ciobanu I; Gheorghita S; Ciobanu V
38449683,Exploring the proficiency of ChatGPT-4: An evaluation of its performance in the Taiwan advanced medical licensing examination.,2024,Digital health,,,,"BACKGROUND: Taiwan is well-known for its quality healthcare system. The country's medical licensing exams offer a way to evaluate ChatGPT's medical proficiency. METHODS: We analyzed exam data from February 2022, July 2022, February 2023, and July 2033. Each exam included four papers with 80 single-choice questions, grouped as descriptive or picture-based. We used ChatGPT-4 for evaluation. Incorrect answers prompted a ""chain of thought"" approach. Accuracy rates were calculated as percentages. RESULTS: ChatGPT-4's accuracy in medical exams ranged from 63.75% to 93.75% (February 2022-July 2023). The highest accuracy (93.75%) was in February 2022's Medicine Exam (3). Subjects with the highest misanswered rates were ophthalmology (28.95%), breast surgery (27.27%), plastic surgery (26.67%), orthopedics (25.00%), and general surgery (24.59%). While using ""chain of thought,"" the ""Accuracy of (CoT) prompting"" ranged from 0.00% to 88.89%, and the final overall accuracy rate ranged from 90% to 98%. CONCLUSION: ChatGPT-4 succeeded in Taiwan's medical licensing exams. With the ""chain of thought"" prompt, it improved accuracy to over 90%.",Lin SY; Chan PK; Hsu WH; Kao CH
39118469,Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study.,2024,JMIR medical education,,,,"BACKGROUND: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. OBJECTIVE: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students' free-text history and physical notes. METHODS: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. RESULTS: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). CONCLUSIONS: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.",Burke HB; Hoang A; Lopreiato JO; King H; Hemmer P; Montgomery M; Gagarin V
39388702,Navigating Nephrology's Decline Through a GPT-4 Analysis of Internal Medicine Specialties in the United States: Qualitative Study.,2024,JMIR medical education,,,,"BACKGROUND: The 2024 Nephrology fellowship match data show the declining interest in nephrology in the United States, with an 11% drop in candidates and a mere 66% (321/488) of positions filled. OBJECTIVE: The study aims to discern the factors influencing this trend using ChatGPT, a leading chatbot model, for insights into the comparative appeal of nephrology versus other internal medicine specialties. METHODS: Using the GPT-4 model, the study compared nephrology with 13 other internal medicine specialties, evaluating each on 7 criteria including intellectual complexity, work-life balance, procedural involvement, research opportunities, patient relationships, career demand, and financial compensation. Each criterion was assigned scores from 1 to 10, with the cumulative score determining the ranking. The approach included counteracting potential bias by instructing GPT-4 to favor other specialties over nephrology in reverse scenarios. RESULTS: GPT-4 ranked nephrology only above sleep medicine. While nephrology scored higher than hospice and palliative medicine, it fell short in key criteria such as work-life balance, patient relationships, and career demand. When examining the percentage of filled positions in the 2024 appointment year match, nephrology's filled rate was 66%, only higher than the 45% (155/348) filled rate of geriatric medicine. Nephrology's score decreased by 4%-14% in 5 criteria including intellectual challenge and complexity, procedural involvement, career opportunity and demand, research and academic opportunities, and financial compensation. CONCLUSIONS: ChatGPT does not favor nephrology over most internal medicine specialties, highlighting its diminishing appeal as a career choice. This trend raises significant concerns, especially considering the overall physician shortage, and prompts a reevaluation of factors affecting specialty choice among medical residents.",Miao J; Thongprayoon C; Garcia Valencia O; Craici IM; Cheungpasitporn W
39050145,Assessing accuracy of ChatGPT in response to questions from day to day pharmaceutical care in hospitals.,2024,Exploratory research in clinical and social pharmacy,,,,"BACKGROUND: The advent of Large Language Models (LLMs) such as ChatGPT introduces opportunities within the medical field. Nonetheless, use of LLM poses a risk when healthcare practitioners and patients present clinical questions to these programs without a comprehensive understanding of its suitability for clinical contexts. OBJECTIVE: The objective of this study was to assess ChatGPT's ability to generate appropriate responses to clinical questions that hospital pharmacists could encounter during routine patient care. METHODS: Thirty questions from 10 different domains within clinical pharmacy were collected during routine care. Questions were presented to ChatGPT in a standardized format, including patients' age, sex, drug name, dose, and indication. Subsequently, relevant information regarding specific cases were provided, and the prompt was concluded with the query ""what would a hospital pharmacist do?"". The impact on accuracy was assessed for each domain by modifying personification to ""what would you do?"", presenting the question in Dutch, and regenerating the primary question. All responses were independently evaluated by two senior hospital pharmacists, focusing on the availability of an advice, accuracy and concordance. RESULTS: In 77% of questions, ChatGPT provided an advice in response to the question. For these responses, accuracy and concordance were determined. Accuracy was correct and complete for 26% of responses, correct but incomplete for 22% of responses, partially correct and partially incorrect for 30% of responses and completely incorrect for 22% of responses. The reproducibility was poor, with merely 10% of responses remaining consistent upon regeneration of the primary question. CONCLUSIONS: While concordance of responses was excellent, the accuracy and reproducibility were poor. With the described method, ChatGPT should not be used to address questions encountered by hospital pharmacists during their shifts. However, it is important to acknowledge the limitations of our methodology, including potential biases, which may have influenced the findings.",van Nuland M; Lobbezoo AH; van de Garde EMW; Herbrink M; van Heijl I; Bognar T; Houwen JPA; Dekens M; Wannet D; Egberts T; van der Linden PD
39430693,Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023.,2024,Digital health,,,,"BACKGROUND: The aim of this study is to evaluate the ability of generative artificial intelligence (AI) models to handle specialized medical knowledge and problem-solving in a formal examination context. METHODS: This research utilized internal medicine exam questions provided by the Taiwan Internal Medicine Society from 2020 to 2023, testing three AI models: GPT-4o, Claude_3.5 Sonnet, and Gemini Advanced models. Rejected queries for Gemini Advanced were translated into French for resubmission. Performance was assessed using IBM SPSS Statistics 26, with accuracy percentages calculated and statistical analyses such as Pearson correlation and analysis of variance (ANOVA) performed to gauge AI efficacy. RESULTS: GPT-4o's top annual score was 86.25 in 2022, with an average of 81.97. Claude_3.5 Sonnet reached a peak score of 88.13 in 2021 and 2022, averaging 84.85, while Gemini Advanced lagged with an average score of 69.84. In specific specialties, Claude_3.5 Sonnet scored highest in Psychiatry (100%) and Nephrology (97.26%), with GPT-4o performing similarly well in Hematology & oncology (97.10%) and Nephrology (94.52%). Gemini's best scores were in Psychiatry (86.96%) and Hematology & Oncology (82.76%). Gemini Advanced models struggled with Neurology, scoring below 60%. Additionally, all models performed better on text-based questions than on image-based ones, without significant differences. Claude 3 Opus scored highest on COVID-19-related questions at 89.29%, followed by GPT-4o at 75.00% and Gemini Advanced at 67.86%. CONCLUSIONS: AI models showed varied proficiency across medical specialties and question types. GPT-4o demonstrated higher image-based correction rates. Claude_3.5 Sonnet generally and consistently outperformed others, highlighting significant potential for AI in assisting medical education.",Lin SY; Hsu YY; Ju SW; Yeh PC; Hsu WH; Kao CH
38797622,ChatGPT performance on radiation technologist and therapist entry to practice exams.,2024,Journal of medical imaging and radiation sciences,,,,"BACKGROUND: The aim of this study was to describe the proficiency of ChatGPT (GPT-4) on certification style exams from the Canadian Association of Medical Radiation Technologists (CAMRT), and describe its performance across multiple exam attempts. METHODS: ChatGPT was prompted with questions from CAMRT practice exams in the disciplines of radiological technology, magnetic resonance (MRI), nuclear medicine and radiation therapy (87-98 questions each). ChatGPT attempted each exam five times. Exam performance was evaluated using descriptive statistics, stratified by discipline and question type (knowledge, application, critical thinking). Light's Kappa was used to assess agreement in answers across attempts. RESULTS: Using a passing grade of 65 %, ChatGPT passed the radiological technology exam only once (20 %), MRI all five times (100 %), nuclear medicine three times (60 %), and radiation therapy all five times (100 %). ChatGPT's performance was best on knowledge questions across all disciplines except radiation therapy. It performed worst on critical thinking questions. Agreement in ChatGPT's responses across attempts was substantial within the disciplines of radiological technology, MRI, and nuclear medicine, and almost perfect for radiation therapy. CONCLUSION: ChatGPT (GPT-4) was able to pass certification style exams for radiation technologists and therapists, but its performance varied between disciplines. The algorithm demonstrated substantial to almost perfect agreement in the responses it provided across multiple exam attempts. Future research evaluating ChatGPT's performance on standardized tests should consider using repeated measures.",Duggan R; Tsuruda KM
37549788,Can ChatGPT pass the thoracic surgery exam?,2023,The American journal of the medical sciences,,,,"BACKGROUND: The capacity of ChatGPT in academic environments and medical exams is being discovered more and more every day. In this study, we tested the success of ChatGPT on Turkish-language thoracic surgery exam questions. METHODS: ChatGPT was provided with a total of 105 questions divided into seven distinct groups, each of which contained 15 questions. Along with the success of the students, the success of ChatGPT-3.5 and ChatGPT-4 architectures in answering the questions correctly was analyzed. RESULTS: The overall mean score of students was 12.50 +/- 1.20, corresponding to 83.33%. Moreover, ChatGPT-3.5 managed to surpass students' score of 12.5 with an average of 13.57 +/- 0.49 questions correctly on average, while ChatGPT-4 answered 14 +/- 0.76 questions correctly (83.3%, 90.48%, and 93.33%, respectively). CONCLUSIONS: When the results of this study and other similar studies in the literature are evaluated together, ChatGPT, which was developed for general purpose, can also produce successful results in a specific field of medicine. AI-powered applications are becoming more and more useful and valuable in providing academic knowledge.",Gencer A; Aydin S
40211256,Evaluating the agreement between ChatGPT-4 and validated questionnaires in screening for anxiety and depression in college students: a cross-sectional study.,2025,BMC psychiatry,,,,"BACKGROUND: The Chat Generative Pre-trained Transformer (ChatGPT), an artificial intelligence-based web application, has demonstrated substantial potential across various knowledge domains, particularly in medicine. This cross-sectional study assessed the validity and possible usefulness of the ChatGPT-4 in assessing anxiety and depression by comparing two questionnaires. METHODS: This study tasked ChatGPT-4 with generating a structured interview questionnaire based on the validated Patient Health Questionnaire-9 (PHQ-9) and Generalized Anxiety Disorder Scale-7 (GAD-7). These new measures were referred to as GPT-PHQ-9 and GPT-GAD-7. This study utilized Spearman correlation analysis, Intra-class correlation coefficients (ICC), Youden's index, receiver operating characteristic (ROC) and Bland-Altman plots to evaluate the consistency between scores from a ChatGPT-4 adapted questionnaire and those from a validated questionnaire. RESULTS: A total of 200 college students participated. Cronbach's alpha indicated acceptable reliability for both GPT-PHQ-9 (alpha = 0.75) and GPT-GAD-7 (alpha = 0.76). ICC values were 0.80 for PHQ-9 and 0.70 for GAD-7. Spearman's correlation showed moderate associations with PHQ-9 (p = 0.63) and GAD-7 (p = 0.68). ROC curve analysis revealed optimal cutoffs of 9.5 for depressive symptoms and 6.5 for anxiety symptoms, both with high sensitivity and specificity. CONCLUSIONS: The questionnaire adapted by ChatGPT-4 demonstrated good consistency with the validated questionnaire. Future studies should investigate the usefulness of the ChatGPT designed questionnaire in different populations.",Liu J; Gu J; Tong M; Yue Y; Qiu Y; Zeng L; Yu Y; Yang F; Zhao S
40393915,Evaluating the Agreement Between ChatGPT-4 and Validated Mental Health Scales in Older Adults: A Cross-Sectional Study.,2025,The American journal of geriatric psychiatry : official journal of the American Association for Geriatric Psychiatry,,,,"BACKGROUND: The chat generative pretrained transformer (ChatGPT), an artificial intelligence-based web application, has demonstrated substantial potential across various knowledge domains, particularly in medicine. This cross-sectional study assessed the validity and possible usefulness of ChatGPT-4 in assessing mental health by comparing several scales. METHODS: A cross-sectional study recruited 127 older adults (>/=60 years old) from December 2023 to October 2024 in Wuhan. Scenarios contextualized to daily life were adapted using ChatGPT-4 to adapt six scales (PHQ-9, GAD-7, K10, PSS-4, ULS-6, WHO-5). The level of agreement between the ChatGPT-4 adapted scales and the traditional scales was compared using Spearman correlation, Cronbach's alpha coefficient, intraclass correlation coefficient (ICC), and Bland-Altman analysis. RESULTS: The ChatGPT-adapted questionnaires showed moderate to strong correlations with traditionally validated measures of anxiety, depression, psychological distress, perceived stress, loneliness, and well-being. Significant positive correlations were observed for total scores, including PHQ-9 (rho = 0.61), GAD-7 (rho = 0.66), K10 (rho = 0.75), PSS-4 (rho = 0.71), and ULS-6 (rho = 0.73), with slightly weaker correlation for WHO-5 (rho = 0.41). Reliability analyses yielded Cronbach's alpha values ranging from 0.66 to 0.91 and ICCs ranging from 0.47 to 0.81, confirming strong internal consistency and test-retest reliability. CONCLUSIONS: Moderate to high correlations were found between the adapted ChatGPT-4 questionnaire and the traditional scale, indicating that it shows promise as a supplemental mental health assessment tool. However, further research is needed to explore its broader applicability.",Liu J; Gu J; Tong M; Yue Y; Qiu Y; Zeng L; Yu Y; Yang F
39139744,Understanding How ChatGPT May Become a Clinical Administrative Tool Through an Investigation on the Ability to Answer Common Patient Questions Concerning Ulnar Collateral Ligament Injuries.,2024,Orthopaedic journal of sports medicine,,,,"BACKGROUND: The consumer availability and automated response functions of chat generator pretrained transformer (ChatGPT-4), a large language model, poise this application to be utilized for patient health queries and may have a role in serving as an adjunct to minimize administrative and clinical burden. PURPOSE: To evaluate the ability of ChatGPT-4 to respond to patient inquiries concerning ulnar collateral ligament (UCL) injuries and compare these results with the performance of Google. STUDY DESIGN: Cross-sectional study. METHODS: Google Web Search was used as a benchmark, as it is the most widely used search engine worldwide and the only search engine that generates frequently asked questions (FAQs) when prompted with a query, allowing comparisons through a systematic approach. The query ""ulnar collateral ligament reconstruction"" was entered into Google, and the top 10 FAQs, answers, and their sources were recorded. ChatGPT-4 was prompted to perform a Google search of FAQs with the same query and to record the sources of answers for comparison. This process was again replicated to obtain 10 new questions requiring numeric instead of open-ended responses. Finally, responses were graded independently for clinical accuracy (grade 0 = inaccurate, grade 1 = somewhat accurate, grade 2 = accurate) by 2 fellowship-trained sports medicine surgeons (D.W.A, J.S.D.) blinded to the search engine and answer source. RESULTS: ChatGPT-4 used a greater proportion of academic sources than Google to provide answers to the top 10 FAQs, although this was not statistically significant (90% vs 50%; P = .14). In terms of question overlap, 40% of the most common questions on Google and ChatGPT-4 were the same. When comparing FAQs with numeric responses, 20% of answers were completely overlapping, 30% demonstrated partial overlap, and the remaining 50% did not demonstrate any overlap. All sources used by ChatGPT-4 to answer these FAQs were academic, while only 20% of sources used by Google were academic (P = .0007). The remaining Google sources included social media (40%), medical practices (20%), single-surgeon websites (10%), and commercial websites (10%). The mean (+/- standard deviation) accuracy for answers given by ChatGPT-4 was significantly greater compared with Google for the top 10 FAQs (1.9 +/- 0.2 vs 1.2 +/- 0.6; P = .001) and top 10 questions with numeric answers (1.8 +/- 0.4 vs 1 +/- 0.8; P = .013). CONCLUSION: ChatGPT-4 is capable of providing responses with clinically relevant content concerning UCL injuries and reconstruction. ChatGPT-4 utilized a greater proportion of academic websites to provide responses to FAQs representative of patient inquiries compared with Google Web Search and provided significantly more accurate answers. Moving forward, ChatGPT has the potential to be used as a clinical adjunct when answering queries about UCL injuries and reconstruction, but further validation is warranted before integrated or autonomous use in clinical settings.",Varady NH; Lu AZ; Mazzucco M; Dines JS; Altchek DW; Williams RJ 3rd; Kunze KN
37609022,ChatGPT-4: Transforming Medical Education and Addressing Clinical Exposure Challenges in the Post-pandemic Era.,2023,Indian journal of orthopaedics,,,,"BACKGROUND: The COVID-19 pandemic has affected medical education, constraining clinical exposure and posing unprecedented challenges for students and junior doctors. This research explores the potential of artificial intelligence (AI), specifically the ChatGPT-4 language model, to transform medical education and address the deficiencies in clinical exposure during the post-pandemic era. RESEARCH QUESTIONS/PURPOSE: What is the potential of AI large language models in delivering safe and coherent medical advice to junior doctors for clinical orthopaedic scenarios? PATIENTS AND METHODS: A series of diverse orthopaedic questions was presented to ChatGPT-4, from general medicine to highly specialised fields. The questions were based on a variety of common orthopaedic presentations including neck of femur fracture, compartment syndrome, pulmonary embolism, and a motor vehicle accident. A validated questionnaire (Likert Scale) was implemented to evaluate the answers produced by ChatGPT-4. RESULTS: Our results indicate that ChatGPT-4 exhibits exceptional proficiency in delivering accurate and coherent medical advice. Its intuitive interface, accessibility, and sophisticated algorithm render it an ideal supplementary tool for medical students and junior doctors. Despite certain limitations, such as its inability to fully address highly specialised areas, this study highlights the potential of AI and ChatGPT-4 to revolutionise medical education and fill the clinical exposure void generated by the pandemic. Future research should concentrate on the practical application of ChatGPT-4 in real-world medical environments and its integration with other emerging technologies to optimise its influence on the education and training of healthcare professionals. CONCLUSIONS: ChatGPT-4's integration into orthopaedic education and practice can mitigate pandemic-related experience gaps, promoting self-directed, personalised learning and decision-making support for interns and residents. Future advancements may address limitations to enhance healthcare professionals' learning and expertise. LEVEL OF EVIDENCE: Level III evidence-observational study.",Lower K; Seth I; Lim B; Seth N
39923067,AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination.,2025,BMC medical education,,,,"BACKGROUND: The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. OBJECTIVE: This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. METHODS: A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs-one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom's taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. RESULTS: Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 +/- 0.22 vs. 0.69 +/- 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 +/- 0.23 vs. 0.26 +/- 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12-0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (chi(2) = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). CONCLUSION: ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.",Law AK; So J; Lui CT; Choi YF; Cheung KH; Kei-Ching Hung K; Graham CA
38686550,"Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.",2024,JMIR medical education,,,,"BACKGROUND: The deployment of OpenAI's ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as ""GPT-4 Turbo With Vision""), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile's medical licensing examinations-a critical step for medical practitioners in Chile-is less explored. This gap highlights the need to evaluate ChatGPT's adaptability to diverse linguistic and cultural contexts. OBJECTIVE: This study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Unico Nacional de Conocimientos de Medicina), a major medical examination in Chile. METHODS: Three official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM's structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate. RESULTS: All versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (P<.001). Version 4V, however, did not outperform version 4 (P=.73), despite the additional visual capabilities. We also evaluated ChatGPT's performance in different medical areas of the EUNACOM and found that versions 4 and 4V consistently outperformed version 3.5. Across the different medical areas, version 3.5 displayed the highest accuracy in psychiatry (69.84%), while versions 4 and 4V achieved the highest accuracy in surgery (90.00% and 86.11%, respectively). Versions 3.5 and 4 had the lowest performance in internal medicine (52.74% and 75.62%, respectively), while version 4V had the lowest performance in public health (74.07%). CONCLUSIONS: This study reveals ChatGPT's ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals.",Rojas M; Rojas M; Burgess V; Toro-Perez J; Salehi S
37812468,ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation.,2023,JMIR medical informatics,,,,"BACKGROUND: The diagnostic accuracy of differential diagnoses generated by artificial intelligence chatbots, including ChatGPT models, for complex clinical vignettes derived from general internal medicine (GIM) department case reports is unknown. OBJECTIVE: This study aims to evaluate the accuracy of the differential diagnosis lists generated by both third-generation ChatGPT (ChatGPT-3.5) and fourth-generation ChatGPT (ChatGPT-4) by using case vignettes from case reports published by the Department of GIM of Dokkyo Medical University Hospital, Japan. METHODS: We searched PubMed for case reports. Upon identification, physicians selected diagnostic cases, determined the final diagnosis, and displayed them into clinical vignettes. Physicians typed the determined text with the clinical vignettes in the ChatGPT-3.5 and ChatGPT-4 prompts to generate the top 10 differential diagnoses. The ChatGPT models were not specially trained or further reinforced for this task. Three GIM physicians from other medical institutions created differential diagnosis lists by reading the same clinical vignettes. We measured the rate of correct diagnosis within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and the top diagnosis. RESULTS: In total, 52 case reports were analyzed. The rates of correct diagnosis by ChatGPT-4 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 83% (43/52), 81% (42/52), and 60% (31/52), respectively. The rates of correct diagnosis by ChatGPT-3.5 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 73% (38/52), 65% (34/52), and 42% (22/52), respectively. The rates of correct diagnosis by ChatGPT-4 were comparable to those by physicians within the top 10 (43/52, 83% vs 39/52, 75%, respectively; P=.47) and within the top 5 (42/52, 81% vs 35/52, 67%, respectively; P=.18) differential diagnosis lists and top diagnosis (31/52, 60% vs 26/52, 50%, respectively; P=.43) although the difference was not significant. The ChatGPT models' diagnostic accuracy did not significantly vary based on open access status or the publication date (before 2011 vs 2022). CONCLUSIONS: This study demonstrates the potential diagnostic accuracy of differential diagnosis lists generated using ChatGPT-3.5 and ChatGPT-4 for complex clinical vignettes from case reports published by the GIM department. The rate of correct diagnoses within the top 10 and top 5 differential diagnosis lists generated by ChatGPT-4 exceeds 80%. Although derived from a limited data set of case reports from a single department, our findings highlight the potential utility of ChatGPT-4 as a supplementary tool for physicians, particularly for those affiliated with the GIM department. Further investigations should explore the diagnostic accuracy of ChatGPT by using distinct case materials beyond its training data. Such efforts will provide a comprehensive insight into the role of artificial intelligence in enhancing clinical decision-making.",Hirosawa T; Kawamura R; Harada Y; Mizuta K; Tokumasu K; Kaji Y; Suzuki T; Shimizu T
39319150,Assessing the ChatGPT aptitude: A competent and effective Dermatology doctor?,2024,Heliyon,,,,"BACKGROUND: The efficacy and adeptness of ChatGPT 3.5 and ChatGPT 4.0 in the precise diagnosis and management of conditions like atopic dermatitis and Autoimmune blistering skin diseases (AIBD) remain to be elucidated. So this study examined the accuracy and effectiveness of the ChatGPT responses related to understanding, therapies, and specific cases of these two conditions. METHOD: Firstly, the responses provided by ChatGPTs to a set of 50 questionnaires underwent evaluation by five distinct dermatologists, with complete adjudication of the third-party reviewer. The comparative analysis included the evaluative efficacy of both ChatGPT3.5 and ChatGPT4.0 against the diagnostic abilities exhibited by three distinct cohorts of qualified clinical professionals. And then, an examination was conducted to assess the diagnostic proficiency of ChatGPT3.5 and ChatGPT4.0 in the context of diagnosing specific instances of skin blistering autoimmune diseases. RESULTS: In assessing the proficiency of ChatGPTs in generating responses related to fundamental knowledge about AD it is noteworthy that both versions of ChatGPTs, despite their lack of specialized training on medical databases, exhibited a commendable capacity to yield solutions that exhibited a substantial degree of concurrence with evidence-based medical information. Accordingly we observed that the performance of ChatGPT-4.0 beyond that of the ChatGPT-3.5. However, it it crucial to emphasize that ChatGPT-4.0 did not show the ability to offer answers surpassing those provided by associate senior, and senior medical professionals. In the assessment designed to determine the proficiency of ChatGPTs in recognizing particular type of AIBD, it is evident that both ChatGPT-4 and ChatGPT-3.5 demonstrated inadequacy in providing responses that are both precise and accurate for each individual occurrence of this skin condition. CONCLUSION: Both ChatGPT-3.5 and ChatGPT-4.0 satisfactory for addressing fundamental inquiries related to atopic dermatitis, however they prove insufficient for diagnosing AIBD. The progress of ChatGPT in achieving utility within the professional medical domain remains a considerable journey ahead.",Lian C; Yuan X; Chokkakula S; Wang G; Song B; Wang Z; Fan G; Yin C
39241674,Accuracy assessment of ChatGPT responses to frequently asked questions regarding anterior cruciate ligament surgery.,2024,The Knee,,,,"BACKGROUND: The emergence of artificial intelligence (AI) has allowed users to have access to large sources of information in a chat-like manner. Thereby, we sought to evaluate ChatGPT-4 response's accuracy to the 10 patient most frequently asked questions (FAQs) regarding anterior cruciate ligament (ACL) surgery. METHODS: A list of the top 10 FAQs pertaining to ACL surgery was created after conducting a search through all Sports Medicine Fellowship Institutions listed on the Arthroscopy Association of North America (AANA) and American Orthopaedic Society of Sports Medicine (AOSSM) websites. A Likert scale was used to grade response accuracy by two sports medicine fellowship-trained surgeons. Cohen's kappa was used to assess inter-rater agreement. Reproducibility of the responses over time was also assessed. RESULTS: Five of the 10 responses received a 'completely accurate' grade by two-fellowship trained surgeons with three additional replies receiving a 'completely accurate' status by at least one. Moreover, inter-rater reliability accuracy assessment revealed a moderate agreement between fellowship-trained attending physicians (weighted kappa = 0.57, 95% confidence interval 0.15-0.99). Additionally, 80% of the responses were reproducible over time. CONCLUSION: ChatGPT can be considered an accurate additional tool to answer general patient questions regarding ACL surgery. None the less, patient-surgeon interaction should not be deferred and must continue to be the driving force for information retrieval. Thus, the general recommendation is to address any questions in the presence of a qualified specialist.",Villarreal-Espinosa JB; Berreta RS; Allende F; Garcia JR; Ayala S; Familiari F; Chahla J
37424120,"Pediatrics in Artificial Intelligence Era: A Systematic Review on Challenges, Opportunities, and Explainability.",2023,Indian pediatrics,,,,"BACKGROUND: The emergence of artificial intelligence (AI) tools such as ChatGPT and Bard is disrupting a broad swathe of fields, including medicine. In pediatric medicine, AI is also increasingly being used across multiple subspecialties. However, the practical application of AI still faces a number of key challenges. Consequently, there is a requirement for a concise overview of the roles of AI across the multiple domains of pediatric medicine, which the current study seeks to address. AIM: To systematically assess the challenges, opportunities, and explainability of AI in pediatric medicine. METHODOLOGY: A systematic search was carried out on peer-reviewed databases, PubMed Central, Europe PubMed Central, and grey literature using search terms related to machine learning (ML) and AI for the years 2016 to 2022 in the English language. A total of 210 articles were retrieved that were screened with PRISMA for abstract, year, language, context, and proximal relevance to research aims. A thematic analysis was carried out to extract findings from the included studies. RESULTS: Twenty articles were selected for data abstraction and analysis, with three consistent themes emerging from these articles. In particular, eleven articles address the current state-of-the-art application of AI in diagnosing and predicting health conditions such as behavioral and mental health, cancer, syndromic and metabolic diseases. Five articles highlight the specific challenges of AI deployment in pediatric medicines: data security, handling, authentication, and validation. Four articles set out future opportunities for AI to be adapted: the incorporation of Big Data, cloud computing, precision medicine, and clinical decision support systems. These studies collectively critically evaluate the potential of AI in overcoming current barriers to adoption. CONCLUSION: AI is proving disruptive within pediatric medicine and is presently associated with challenges, opportunities, and the need for explainability. AI should be viewed as a tool to enhance and support clinical decision-making rather than a substitute for human judgement and expertise. Future research should consequently focus on obtaining comprehensive data to ensure the generalizability of research findings.",Balla Y; Tirunagari S; Windridge D
38116306,"ChatGPT and Clinical Training: Perception, Concerns, and Practice of Pharm-D Students.",2023,Journal of multidisciplinary healthcare,,,,"BACKGROUND: The emergence of Chat-Generative Pre-trained Transformer (ChatGPT) by OpenAI has revolutionized AI technology, demonstrating significant potential in healthcare and pharmaceutical education, yet its real-world applicability in clinical training warrants further investigation. METHODS: A cross-sectional study was conducted between April and May 2023 to assess PharmD students' perceptions, concerns, and experiences regarding the integration of ChatGPT into clinical pharmacy education. The study utilized a convenient sampling method through online platforms and involved a questionnaire with sections on demographics, perceived benefits, concerns, and experience with ChatGPT. Statistical analysis was performed using SPSS, including descriptive and inferential analyses. RESULTS: The findings of the study involving 211 PharmD students revealed that the majority of participants were male (77.3%), and had prior experience with artificial intelligence (68.2%). Over two-thirds were aware of ChatGPT. Most students (n= 139, 65.9%) perceived potential benefits in using ChatGPT for various clinical tasks, with concerns including over-reliance, accuracy, and ethical considerations. Adoption of ChatGPT in clinical training varied, with some students not using it at all, while others utilized it for tasks like evaluating drug-drug interactions and developing care plans. Previous users tended to have higher perceived benefits and lower concerns, but the differences were not statistically significant. CONCLUSION: Utilizing ChatGPT in clinical training offers opportunities, but students' lack of trust in it for clinical decisions highlights the need for collaborative human-ChatGPT decision-making. It should complement healthcare professionals' expertise and be used strategically to compensate for human limitations. Further research is essential to optimize ChatGPT's effective integration.",Zawiah M; Al-Ashwal FY; Gharaibeh L; Abu Farha R; Alzoubi KH; Abu Hammour K; Qasim QA; Abrah F
39515421,Evaluating a generative artificial intelligence accuracy in providing medication instructions from smartphone images.,2025,Journal of the American Pharmacists Association : JAPhA,,,,"BACKGROUND: The Food and Drug Administration mandates patient labeling materials like the Medication Guide (MG) and Instructions for Use (IFU) to support appropriate medication use. However, challenges such as low health literacy and difficulties navigating these materials may lead to incorrect medication usage, resulting in therapy failure or adverse outcomes. The rise of generative AI, presents an opportunity to provide scalable, personalized patient education through image recognition and text generation. OBJECTIVE: This study aimed to evaluate the accuracy and safety of medication instructions generated by ChatGPT based on user-provided drug images, compared to the manufacturer's standard instructions. METHODS: Images of 12 medications requiring multiple steps for administration were uploaded to ChatGPT's image recognition function. ChatGPT's responses were compared to the official IFU and MG using text classifiers, Count Vectorization (CountVec), and Term Frequency-Inverse Document Frequency (TF-IDF). The clinical accuracy was further evaluated by independent pharmacists to determine if ChatGPT responses were valid for patient instruction. RESULTS: ChatGPT correctly identified all medications and generated patient instructions. CountVec outperformed TF-IDF in text similarity analysis, with an average similarity score of 76%. However, clinical evaluation revealed significant gaps in the instructions, particularly for complex administration routes, where ChatGPT's guidance lacked essential details, leading to lower clinical accuracy scores. CONCLUSION: While ChatGPT shows promise in generating patient-friendly medication instructions, its effectiveness varies based on the complexity of the medication. The findings underscore the need for further refinement and clinical oversight to ensure the safety and accuracy of AI-generated medical guidance, particularly for medications with complex administration processes.",Yassin Y; Nguyen T; Panchal K; Getchell K; Aungst T
39322837,Advancement of Generative Pre-trained Transformer Chatbots in Answering Clinical Questions in the Practical Rhinoplasty Guideline.,2025,Aesthetic plastic surgery,,,,"BACKGROUND: The Generative Pre-trained Transformer (GPT) series, which includes ChatGPT, is an artificial large language model that provides human-like text dialogue. This study aimed to evaluate the performance of artificial intelligence chatbots in answering clinical questions based on practical rhinoplasty guidelines. METHODS: Clinical questions (CQs) developed from the guidelines were used as question sources. For each question, we asked GPT-4 and GPT-3.5 (ChatGPT), developed by OpenAI, to provide answers for the CQs, Policy Level, Aggregate Evidence Quality, Level of Confidence in Evidence, and References. We compared the performance of the two types of artificial intelligence (AI) chatbots. RESULTS: A total of 10 questions were included in the final analysis, and the AI chatbots correctly answered 90.0% of these. GPT-4 demonstrated a lower accuracy rate than GPT-3.5 in answering CQs, although without statistically significant difference (86.0% vs. 94.0%; p = 0.05), whereas GPT-4 showed significantly higher accuracy for the level of confidence in Evidence than GPT-3.5 (52.0% vs. 28.0%; p < 0.01). No statistical differences were observed in Policy Level, Aggregate Evidence Quality, and Reference Match. In addition, GPT-4 rated significantly higher in presenting existing references than GPT-3.5 (36.9% vs. 24.1%; p = 0.01). CONCLUSIONS: The overall performance of GPT-4 was similar to that of GPT-3.5. However, GPT-4 provided existing references at a higher rate than GPT-3.5. GPT-4 has the potential to provide a more accurate reference in professional fields, including rhinoplasty. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Shiraishi M; Tsuruda S; Tomioka Y; Chang J; Hori A; Ishii S; Fujinaka R; Ando T; Ohba J; Okazaki M
40116759,Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination.,2025,JMIR medical education,,,,"BACKGROUND: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. OBJECTIVE: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed ""confidence accuracy"" to evaluate it. METHODS: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. RESULTS: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. CONCLUSIONS: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain.",Madrid J; Diehl P; Selig M; Rolauffs B; Hans FP; Busch HJ; Scheef T; Benning L
38037784,"Exploring artificial intelligence in the Nigerian medical educational space: An online cross-sectional study of perceptions, risks and benefits among students and lecturers from ten universities.",2023,The Nigerian postgraduate medical journal,,,,"BACKGROUND: The impact of artificial intelligence (AI) has been compared to that of the Internet and printing, evoking both apprehension and anticipation in an uncertain world. OBJECTIVE: This study aimed to explore the perceptions of medical students and faculty members from ten universities across Nigeria regarding AI. METHODS: Using Google Forms and WhatsApp, a cross-sectional online survey was administered to clinical year medical students and their lecturers from ten medical schools representing all the six geopolitical zones of Nigeria. RESULTS: The survey received 1003 responses, of which 708 (70.7%) were from students and 294 (29.3%) were from lecturers. Both groups displayed an average level of knowledge, with students (Median:4, range -5 to 12) significantly outperforming lecturers (Median:3, range -5 to 15). Social media (61.2%) was the most common form of first contact with AI. Participants demonstrated a favourable attitude towards AI, with a median score of 6.8 out of 10. Grammar checkers (62.3%) were the most commonly reported AI tool used, while ChatGPT (43.6%) was the most frequently mentioned dedicated AI tool. Students were significantly more likely than lecturers to have used AI tools in the past but <5% of both groups had received prior AI training. Excitement about the potential of AI slightly outweighed concerns regarding future risks. A significantly higher proportion of students compared to lecturers believed that AI could dehumanise health care (70.6% vs. 60.8%), render physicians redundant (57.6% vs. 34.7%), diminish physicians' skills (79.3% vs. 71.3%) and ultimately harm patients (28.6% vs. 20.6%). CONCLUSION: The simultaneous fascination and apprehension with AI observed among both lecturers and students in our study mirrors the global trend. This finding was particularly evident in students who, despite possessing greater knowledge of AI compared to their lecturers, did not exhibit a corresponding reduction in their fear of AI.",Oluwadiya KS; Adeoti AO; Agodirin SO; Nottidge TE; Usman MI; Gali MB; Onyemaechi NO; Ramat AM; Adedire A; Zakari LY
39883487,Using Large Language Models to Detect and Understand Drug Discontinuation Events in Web-Based Forums: Development and Validation Study.,2025,Journal of medical Internet research,,,,"BACKGROUND: The implementation of large language models (LLMs), such as BART (Bidirectional and Auto-Regressive Transformers) and GPT-4, has revolutionized the extraction of insights from unstructured text. These advancements have expanded into health care, allowing analysis of social media for public health insights. However, the detection of drug discontinuation events (DDEs) remains underexplored. Identifying DDEs is crucial for understanding medication adherence and patient outcomes. OBJECTIVE: The aim of this study is to provide a flexible framework for investigating various clinical research questions in data-sparse environments. We provide an example of the utility of this framework by identifying DDEs and their root causes in an open-source web-based forum, MedHelp, and by releasing the first open-source DDE datasets to aid further research in this domain. METHODS: We used several LLMs, including GPT-4 Turbo, GPT-4o, DeBERTa (Decoding-Enhanced Bidirectional Encoder Representations from Transformer with Disentangled Attention), and BART, among others, to detect and determine the root causes of DDEs in user comments posted on MedHelp. Our study design included the use of zero-shot classification, which allows these models to make predictions without task-specific training. We split user comments into sentences and applied different classification strategies to assess the performance of these models in identifying DDEs and their root causes. RESULTS: Among the selected models, GPT-4o performed the best at determining the root causes of DDEs, predicting only 12.9% of root causes incorrectly (hamming loss). Among the open-source models tested, BART demonstrated the best performance in detecting DDEs, achieving an F(1)-score of 0.86, a false positive rate of 2.8%, and a false negative rate of 6.5%, all without any fine-tuning. The dataset included 10.7% (107/1000) DDEs, emphasizing the models' robustness in an imbalanced data context. CONCLUSIONS: This study demonstrated the effectiveness of open- and closed-source LLMs, such as GPT-4o and BART, for detecting DDEs and their root causes from publicly accessible data through zero-shot classification. The robust and scalable framework we propose can aid researchers in addressing data-sparse clinical research questions. The launch of open-access DDE datasets has the potential to stimulate further research and novel discoveries in this field.",Trevena W; Zhong X; Alvarado M; Semenov A; Oktay A; Devlin D; Gohil AY; Chittimouju SH
39913914,ChatGPT-4 Performance on German Continuing Medical Education-Friend or Foe (Trick or Treat)? Protocol for a Randomized Controlled Trial.,2025,JMIR research protocols,,,,"BACKGROUND: The increasing development and spread of artificial and assistive intelligence is opening up new areas of application not only in applied medicine but also in related fields such as continuing medical education (CME), which is part of the mandatory training program for medical doctors in Germany. This study aimed to determine whether medical laypersons can successfully conduct training courses specifically for physicians with the help of a large language model (LLM) such as ChatGPT-4. This study aims to qualitatively and quantitatively investigate the impact of using artificial intelligence (AI; specifically ChatGPT) on the acquisition of credit points in German postgraduate medical education. OBJECTIVE: Using this approach, we wanted to test further possible applications of AI in the postgraduate medical education setting and obtain results for practical use. Depending on the results, the potential influence of LLMs such as ChatGPT-4 on CME will be discussed, for example, as part of a SWOT (strengths, weaknesses, opportunities, threats) analysis. METHODS: We designed a randomized controlled trial, in which adult high school students attempt to solve CME tests across six medical specialties in three study arms in total with 18 CME training courses per study arm under different interventional conditions with varying amounts of permitted use of ChatGPT-4. Sample size calculation was performed including guess probability (20% correct answers, SD=40%; confidence level of 1-alpha=.95/alpha=.05; test power of 1-beta=.95; P<.05). The study was registered at open scientific framework. RESULTS: As of October 2024, the acquisition of data and students to participate in the trial is ongoing. Upon analysis of our acquired data, we predict our findings to be ready for publication as soon as early 2025. CONCLUSIONS: We aim to prove that the advances in AI, especially LLMs such as ChatGPT-4 have considerable effects on medical laypersons' ability to successfully pass CME tests. The implications that this holds on how the concept of continuous medical education requires reevaluation are yet to be contemplated. TRIAL REGISTRATION: OSF Registries 10.17605/OSF.IO/MZNUF; https://osf.io/mznuf. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): PRR1-10.2196/63887.",Burisch C; Bellary A; Breuckmann F; Ehlers J; Thal SC; Sellmann T; Godde D
39966152,Assessing the Informational Value of Large Language Models Responses in Aesthetic Surgery: A Comparative Analysis with Expert Opinions.,2025,Aesthetic plastic surgery,,,,"BACKGROUND: The increasing popularity of Large Language Models (LLMs) in various healthcare settings has raised questions about their ability to provide accurate and reliable information. This study aimed to evaluate the informational value of Large Language Models responses in aesthetic plastic surgery by comparing them with the opinions of experienced surgeons. METHODS: Thirty patients undergoing three common aesthetic procedures-dermal fillers, botulinum toxin injections, and aesthetic blepharoplasty-were selected. The most frequently asked questions by these patients were recorded and submitted to ChatGpt 3.5 and Google Bard v.1.53. The answers provided by the Large Language Models were then evaluated by 13 experienced aesthetic plastic surgeons on a Likert scale for accessibility, accuracy, and overall usefulness. RESULTS: The overall ratings of the chatbot responses were moderate, with surgeons generally finding them to be accurate and clear. However, the lack of transparency regarding the sources of the information provided by the LLMs made it impossible to fully evaluate their credibility. CONCLUSIONS: While chatbots have the potential to provide patients with convenient access to information about aesthetic plastic surgery, their current limitations in terms of transparency and comprehensiveness warrant caution in their use as a primary source of information. Further research is needed to develop more robust and reliable LLMs for healthcare applications. LEVEL OF EVIDENCE I: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Grippaudo FR; Jeri M; Pezzella M; Orlando M; Ribuffo D
39752214,"ChatGPT's Attitude, Knowledge, and Clinical Application in Geriatrics Practice and Education: Exploratory Observational Study.",2025,JMIR formative research,,,,"BACKGROUND: The increasing use of ChatGPT in clinical practice and medical education necessitates the evaluation of its reliability, particularly in geriatrics. OBJECTIVE: This study aimed to evaluate ChatGPT's trustworthiness in geriatrics through 3 distinct approaches: evaluating ChatGPT's geriatrics attitude, knowledge, and clinical application with 2 vignettes of geriatric syndromes (polypharmacy and falls). METHODS: We used the validated University of California, Los Angeles, geriatrics attitude and knowledge instruments to evaluate ChatGPT's geriatrics attitude and knowledge and compare its performance with that of medical students, residents, and geriatrics fellows from reported results in the literature. We also evaluated ChatGPT's application to 2 vignettes of geriatric syndromes (polypharmacy and falls). RESULTS: The mean total score on geriatrics attitude of ChatGPT was significantly lower than that of trainees (medical students, internal medicine residents, and geriatric medicine fellows; 2.7 vs 3.7 on a scale from 1-5; 1=strongly disagree; 5=strongly agree). The mean subscore on positive geriatrics attitude of ChatGPT was higher than that of the trainees (medical students, internal medicine residents, and neurologists; 4.1 vs 3.7 on a scale from 1 to 5 where a higher score means a more positive attitude toward older adults). The mean subscore on negative geriatrics attitude of ChatGPT was lower than that of the trainees and neurologists (1.8 vs 2.8 on a scale from 1 to 5 where a lower subscore means a less negative attitude toward aging). On the University of California, Los Angeles geriatrics knowledge test, ChatGPT outperformed all medical students, internal medicine residents, and geriatric medicine fellows from validated studies (14.7 vs 11.3 with a score range of -18 to +18 where +18 means that all questions were answered correctly). Regarding the polypharmacy vignette, ChatGPT not only demonstrated solid knowledge of potentially inappropriate medications but also accurately identified 7 common potentially inappropriate medications and 5 drug-drug and 3 drug-disease interactions. However, ChatGPT missed 5 drug-disease and 1 drug-drug interaction and produced 2 hallucinations. Regarding the fall vignette, ChatGPT answered 3 of 5 pretests correctly and 2 of 5 pretests partially correctly, identified 6 categories of fall risks, followed fall guidelines correctly, listed 6 key physical examinations, and recommended 6 categories of fall prevention methods. CONCLUSIONS: This study suggests that ChatGPT can be a valuable supplemental tool in geriatrics, offering reliable information with less age bias, robust geriatrics knowledge, and comprehensive recommendations for managing 2 common geriatric syndromes (polypharmacy and falls) that are consistent with evidence from guidelines, systematic reviews, and other types of studies. ChatGPT's potential as an educational and clinical resource could significantly benefit trainees, health care providers, and laypeople. Further research using GPT-4o, larger geriatrics question sets, and more geriatric syndromes is needed to expand and confirm these findings before adopting ChatGPT widely for geriatrics education and practice.",Cheng HY
38882956,Does the Information Quality of ChatGPT Meet the Requirements of Orthopedics and Trauma Surgery?,2024,Cureus,,,,"BACKGROUND: The integration of artificial intelligence (AI) in medicine, particularly through AI-based language models like ChatGPT, offers a promising avenue for enhancing patient education and healthcare delivery. This study aims to evaluate the quality of medical information provided by Chat Generative Pre-trained Transformer (ChatGPT) regarding common orthopedic and trauma surgical procedures, assess its limitations, and explore its potential as a supplementary source for patient education. METHODS: Using the GPT-3.5-Turbo version of ChatGPT, simulated patient information was generated for 20 orthopedic and trauma surgical procedures. The study utilized standardized information forms as a reference for evaluating ChatGPT's responses. The accuracy and quality of the provided information were assessed using a modified DISCERN instrument, and a global medical assessment was conducted to categorize the information's usefulness and reliability. RESULTS: ChatGPT mentioned an average of 47% of relevant keywords across procedures, with a variance in the mention rate between 30.5% and 68.6%. The average modified DISCERN (mDISCERN) score was 2.4 out of 5, indicating a moderate to low quality of information. None of the ChatGPT-generated fact sheets were rated as ""very useful,"" with 45% deemed ""somewhat useful,"" 35% ""not useful,"" and 20% classified as ""dangerous."" A positive correlation was found between higher mDISCERN scores and better physician ratings, suggesting that information quality directly impacts perceived utility. CONCLUSION: While AI-based language models like ChatGPT hold significant promise for medical education and patient care, the current quality of information provided in the field of orthopedics and trauma surgery is suboptimal. Further development and refinement of AI sources and algorithms are necessary to improve the accuracy and reliability of medical information. This study underscores the need for ongoing research and development in AI applications in healthcare, emphasizing the critical role of accurate, high-quality information in patient education and informed consent processes.",Kasapovic A; Ali T; Babasiz M; Bojko J; Gathen M; Kaczmarczyk R; Roos J
38506920,Incorporating ChatGPT in Medical Informatics Education: Mixed Methods Study on Student Perceptions and Experiential Integration Proposals.,2024,JMIR medical education,,,,"BACKGROUND: The integration of artificial intelligence (AI) technologies, such as ChatGPT, in the educational landscape has the potential to enhance the learning experience of medical informatics students and prepare them for using AI in professional settings. The incorporation of AI in classes aims to develop critical thinking by encouraging students to interact with ChatGPT and critically analyze the responses generated by the chatbot. This approach also helps students develop important skills in the field of biomedical and health informatics to enhance their interaction with AI tools. OBJECTIVE: The aim of the study is to explore the perceptions of students regarding the use of ChatGPT as a learning tool in their educational context and provide professors with examples of prompts for incorporating ChatGPT into their teaching and learning activities, thereby enhancing the educational experience for students in medical informatics courses. METHODS: This study used a mixed methods approach to gain insights from students regarding the use of ChatGPT in education. To accomplish this, a structured questionnaire was applied to evaluate students' familiarity with ChatGPT, gauge their perceptions of its use, and understand their attitudes toward its use in academic and learning tasks. Learning outcomes of 2 courses were analyzed to propose ChatGPT's incorporation in master's programs in medicine and medical informatics. RESULTS: The majority of students expressed satisfaction with the use of ChatGPT in education, finding it beneficial for various purposes, including generating academic content, brainstorming ideas, and rewriting text. While some participants raised concerns about potential biases and the need for informed use, the overall perception was positive. Additionally, the study proposed integrating ChatGPT into 2 specific courses in the master's programs in medicine and medical informatics. The incorporation of ChatGPT was envisioned to enhance student learning experiences and assist in project planning, programming code generation, examination preparation, workflow exploration, and technical interview preparation, thus advancing medical informatics education. In medical teaching, it will be used as an assistant for simplifying the explanation of concepts and solving complex problems, as well as for generating clinical narratives and patient simulators. CONCLUSIONS: The study's valuable insights into medical faculty students' perspectives and integration proposals for ChatGPT serve as an informative guide for professors aiming to enhance medical informatics education. The research delves into the potential of ChatGPT, emphasizes the necessity of collaboration in academic environments, identifies subject areas with discernible benefits, and underscores its transformative role in fostering innovative and engaging learning experiences. The envisaged proposals hold promise in empowering future health care professionals to work in the rapidly evolving era of digital health care.",Magalhaes Araujo S; Cruz-Correia R
40106227,Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study.,2025,JMIR medical education,,,,"BACKGROUND: The integration of artificial intelligence (AI), notably ChatGPT, into medical education, has shown promising results in various medical fields. Nevertheless, its efficacy in traditional Chinese medicine (TCM) examinations remains understudied. OBJECTIVE: This study aims to (1) assess the performance of ChatGPT on the TCM licensing examination in Taiwan and (2) evaluate the model's explainability in answering TCM-related questions to determine its suitability as a TCM learning tool. METHODS: We used the GPT-4 model to respond to 480 questions from the 2022 TCM licensing examination. This study compared the performance of the model against that of licensed TCM doctors using 2 approaches, namely direct answer selection and provision of explanations before answer selection. The accuracy and consistency of AI-generated responses were analyzed. Moreover, a breakdown of question characteristics was performed based on the cognitive level, depth of knowledge, types of questions, vignette style, and polarity of questions. RESULTS: ChatGPT achieved an overall accuracy of 43.9%, which was lower than that of 2 human participants (70% and 78.4%). The analysis did not reveal a significant correlation between the accuracy of the model and the characteristics of the questions. An in-depth examination indicated that errors predominantly resulted from a misunderstanding of TCM concepts (55.3%), emphasizing the limitations of the model with regard to its TCM knowledge base and reasoning capability. CONCLUSIONS: Although ChatGPT shows promise as an educational tool, its current performance on TCM licensing examinations is lacking. This highlights the need for enhancing AI models with specialized TCM training and suggests a cautious approach to utilizing AI for TCM education. Future research should focus on model improvement and the development of tailored educational applications to support TCM learning.",Tseng LW; Lu YC; Tseng LC; Chen YC; Chen HY
39760952,Artificial Intelligence in Physical Therapy: Evaluating ChatGPT's Role in Clinical Decision Support for Musculoskeletal Care.,2025,Annals of biomedical engineering,,,,"BACKGROUND: The integration of artificial intelligence into medicine has attracted increasing attention in recent years. ChatGPT has emerged as a promising tool for delivering evidence-based recommendations in various clinical domains. However, the application of ChatGPT to physical therapy for musculoskeletal conditions has yet to be investigated. METHODS: Thirty clinical questions related to spinal, lower extremity, and upper extremity conditions were quired to ChatGPT-4. Responses were assessed for accuracy against clinical practice guidelines by two reviewers. Intra- and inter-rater reliability were measured using Fleiss' kappa (k). RESULTS: ChatGPT's responses were consistent with CPG recommendations for 80% of the questions. Performance was highest for upper extremity conditions (100%) and lowest for spinal conditions (60%), with a moderate performance for lower extremity conditions (87%). Intra-rater reliability was good (k = 0.698 and k = 0.631 for the two reviewers), and inter-rater reliability was very good (k = 0.847). CONCLUSION: ChatGPT demonstrates promise as a supplementary decision-making support tool for physical therapy, with good accuracy and reliability in aligning with clinical practice guideline recommendations. Further research is needed to evaluate its performance across broader scenarios and refine its clinical applicability.",Hao J; Yao Z; Tang Y; Remis A; Wu K; Yu X
39013110,ChatGPT vs Medical Professional: Analyzing Responses to Laboratory Medicine Questions on Social Media.,2024,Clinical chemistry,,,,"BACKGROUND: The integration of ChatGPT, a large language model (LLM) developed by OpenAI, into healthcare has sparked significant interest due to its potential to enhance patient care and medical education. With the increasing trend of patients accessing laboratory results online, there is a pressing need to evaluate the effectiveness of ChatGPT in providing accurate laboratory medicine information. Our study evaluates ChatGPT's effectiveness in addressing patient questions in this area, comparing its performance with that of medical professionals on social media. METHODS: This study sourced patient questions and medical professional responses from Reddit and Quora, comparing them with responses generated by ChatGPT versions 3.5 and 4.0. Experienced laboratory medicine professionals evaluated the responses for quality and preference. Evaluation results were further analyzed using R software. RESULTS: The study analyzed 49 questions, with evaluators reviewing responses from both medical professionals and ChatGPT. ChatGPT's responses were preferred by 75.9% of evaluators and generally received higher ratings for quality. They were noted for their comprehensive and accurate information, whereas responses from medical professionals were valued for their conciseness. The interrater agreement was fair, indicating some subjectivity but a consistent preference for ChatGPT's detailed responses. CONCLUSIONS: ChatGPT demonstrates potential as an effective tool for addressing queries in laboratory medicine, often surpassing medical professionals in response quality. These results support the need for further research to confirm ChatGPT's utility and explore its integration into healthcare settings.",Girton MR; Greene DN; Messerlian G; Keren DF; Yu M
39844924,Large Language Models in Healthcare: A Bibliometric Analysis and Examination of Research Trends.,2025,Journal of multidisciplinary healthcare,,,,"BACKGROUND: The integration of large language models (LLMs) in healthcare has generated significant interest due to their potential to improve diagnostic accuracy, personalization of treatment, and patient care efficiency. OBJECTIVE: This study aims to conduct a comprehensive bibliometric analysis to identify current research trends, main themes and future directions regarding applications in the healthcare sector. METHODS: A systematic scan of publications until 08.05.2024 was carried out from an important database such as Web of Science.Using bibliometric tools such as VOSviewer and CiteSpace, we analyzed data covering publication counts, citation analysis, co-authorship, co- occurrence of keywords and thematic development to map the intellectual landscape and collaborative networks in this field. RESULTS: The analysis included more than 500 articles published between 2021 and 2024. The United States, Germany and the United Kingdom were the top contributors to this field. The study highlights that neural network applications in diagnostic imaging, natural language processing for clinical documentation, and patient data in the field of general internal medicine, radiology, medical informatics, health care services, surgery, oncology, ophthalmology, neurology, orthopedics and psychiatry have seen significant growth in publications over the past two years. Keyword trend analysis revealed emerging sub-themes such as clinical research, artificial intelligence, ChatGPT, education, natural language processing, clinical management, virtual reality, chatbot, indicating a shift towards addressing the broader implications of LLM application in healthcare. CONCLUSION: The use of LLM in healthcare is an expanding field with significant academic and clinical interest. This bibliometric analysis not only maps the current state of the research, but also identifies important areas that require further research and development. Continued advances in this field are expected to significantly impact future healthcare applications, with a focus on increasing the accuracy and personalization of patient care through advanced data analytics.",Gencer G; Gencer K
38414074,Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology - a recent scoping review.,2024,Diagnostic pathology,,,,"BACKGROUND: The integration of large language models (LLMs) like ChatGPT in diagnostic medicine, with a focus on digital pathology, has garnered significant attention. However, understanding the challenges and barriers associated with the use of LLMs in this context is crucial for their successful implementation. METHODS: A scoping review was conducted to explore the challenges and barriers of using LLMs, in diagnostic medicine with a focus on digital pathology. A comprehensive search was conducted using electronic databases, including PubMed and Google Scholar, for relevant articles published within the past four years. The selected articles were critically analyzed to identify and summarize the challenges and barriers reported in the literature. RESULTS: The scoping review identified several challenges and barriers associated with the use of LLMs in diagnostic medicine. These included limitations in contextual understanding and interpretability, biases in training data, ethical considerations, impact on healthcare professionals, and regulatory concerns. Contextual understanding and interpretability challenges arise due to the lack of true understanding of medical concepts and lack of these models being explicitly trained on medical records selected by trained professionals, and the black-box nature of LLMs. Biases in training data pose a risk of perpetuating disparities and inaccuracies in diagnoses. Ethical considerations include patient privacy, data security, and responsible AI use. The integration of LLMs may impact healthcare professionals' autonomy and decision-making abilities. Regulatory concerns surround the need for guidelines and frameworks to ensure safe and ethical implementation. CONCLUSION: The scoping review highlights the challenges and barriers of using LLMs in diagnostic medicine with a focus on digital pathology. Understanding these challenges is essential for addressing the limitations and developing strategies to overcome barriers. It is critical for health professionals to be involved in the selection of data and fine tuning of the models. Further research, validation, and collaboration between AI developers, healthcare professionals, and regulatory bodies are necessary to ensure the responsible and effective integration of LLMs in diagnostic medicine.",Ullah E; Parwani A; Baig MM; Singh R
39851275,The Potential Clinical Utility of the Customized Large Language Model in Gastroenterology: A Pilot Study.,2024,"Bioengineering (Basel, Switzerland)",,,,"Background: The large language model (LLM) has the potential to be applied to clinical practice. However, there has been scarce study on this in the field of gastroenterology. Aim: This study explores the potential clinical utility of two LLMs in the field of gastroenterology: a customized GPT model and a conventional GPT-4o, an advanced LLM capable of retrieval-augmented generation (RAG). Method: We established a customized GPT with the BM25 algorithm using Open AI's GPT-4o model, which allows it to produce responses in the context of specific documents including textbooks of internal medicine (in English) and gastroenterology (in Korean). Also, we prepared a conventional ChatGPT 4o (accessed on 16 October 2024) access. The benchmark (written in Korean) consisted of 15 clinical questions developed by four clinical experts, representing typical questions for medical students. The two LLMs, a gastroenterology fellow, and an expert gastroenterologist were tested to assess their performance. Results: While the customized LLM correctly answered 8 out of 15 questions, the fellow answered 10 correctly. When the standardized Korean medical terms were replaced with English terminology, the LLM's performance improved, answering two additional knowledge-based questions correctly, matching the fellow's score. However, judgment-based questions remained a challenge for the model. Even with the implementation of 'Chain of Thought' prompt engineering, the customized GPT did not achieve improved reasoning. Conventional GPT-4o achieved the highest score among the AI models (14/15). Although both models performed slightly below the expert gastroenterologist's level (15/15), they show promising potential for clinical applications (scores comparable with or higher than that of the gastroenterology fellow). Conclusions: LLMs could be utilized to assist with specialized tasks such as patient counseling. However, RAG capabilities by enabling real-time retrieval of external data not included in the training dataset, appear essential for managing complex, specialized content, and clinician oversight will remain crucial to ensure safe and effective use in clinical practice.",Gong EJ; Bang CS; Lee JJ; Park J; Kim E; Kim S; Kimm M; Choi SH
39593074,Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review.,2024,BMC medical informatics and decision making,,,,"BACKGROUND: The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated. METHODS: We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans. RESULTS: We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were ""accuracy"", ""completeness"", ""appropriateness"", ""insight"", and ""consistency"". CONCLUSIONS: The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.",Ho CN; Tian T; Ayers AT; Aaron RE; Phillips V; Wolf RM; Mathioudakis N; Dai T; Klonoff DC
40348690,Can Gpt-4o Accurately Diagnose Trauma X-Rays? A Comparative Study with Expert Evaluations.,2025,The Journal of emergency medicine,,,,"BACKGROUND: The latest artificial intelligence (AI) model, GPT-4o, introduced by OpenAI, can process visual data, presenting a novel opportunity for radiographic evaluation in trauma patients. OBJECTIVE: This study aimed to assess the efficacy of GPT-4o in interpreting radiographs for traumatic bone pathologies and to compare its performance with that of emergency medicine and orthopedic specialists. METHODS: The study involved 10 emergency medicine specialists, 10 orthopedic specialists, and the GPT-4o AI model, evaluating 25 cases of traumatic bone pathologies of the upper and lower extremities selected from the Radiopaedia website. Participants were asked to identify fractures or dislocations in the radiographs within 45 minutes. GPT-4o was instructed to perform the same task in 10 different chat sessions. RESULTS: Emergency medicine specialists and orthopedic specialists demonstrated an average accuracy of 82.8% and 87.2%, respectively, in radiograph interpretation. In contrast, GPT-4o achieved an accuracy of only 11.2%. Statistical analysis revealed significant differences among the three groups (p < 0.001), with GPT-4o performing significantly worse than both groups of specialists. CONCLUSION: GPT-4o's ability to interpret radiographs of traumatic bone pathologies is currently limited and significantly inferior to that of trained specialists. These findings underscore the ongoing need for human expertise in trauma diagnosis and highlight the challenges of applying AI to complex medical imaging tasks.",Ozturk A; Gunay S; Ates S; Yigit Yavuz Yigit Y
39509695,Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.,2024,Journal of medical Internet research,,,,"BACKGROUND: The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. OBJECTIVE: This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. METHODS: We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. RESULTS: Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. CONCLUSIONS: Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.",Wang L; Wan Z; Ni C; Song Q; Li Y; Clayton E; Malin B; Yin Z
39402411,Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education.,2025,Journal of general internal medicine,,,,"BACKGROUND: The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses. OBJECTIVE: To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education. DESIGN AND PARTICIPANTS: This was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year. INTERVENTION: An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade. MAIN MEASURES: Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied. KEY RESULTS: In this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61-0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours. CONCLUSIONS: This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.",Sreedhar R; Chang L; Gangopadhyaya A; Shiels PW; Loza J; Chi E; Gabel E; Park YS
39507012,Performance of ChatGPT on prehospital acute ischemic stroke and large vessel occlusion (LVO) stroke screening.,2024,Digital health,,,,"BACKGROUND: The management of acute ischemic stroke (AIS) is time-sensitive, yet prehospital delays remain prevalent. The application of large language models (LLMs) for medical text analysis may play a potential role in clinical decision support. We assess the performance of LLMs on prehospital AIS and large vessel occlusion (LVO) stroke screening. METHODS: This retrospective study sourced cases from the electronic medical record database of the emergency department (ED) at Maoming People's Hospital, encompassing patients who presented to the ED between June and November 2023. We evaluate the diagnostic accuracy of GPT-3.5 and GPT-4 for the detection of AIS and LVO stroke by comparing the sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and positive likelihood ratio and AUC of both LLMs. The neurological reasoning of LLMs was rated on a five-point Likert scale for factual correctness and the occurrence of errors. RESULT: On 400 records from 400 patients (mean age, 70.0 years +/- 12.5 [SD]; 273 male), GPT-4 outperformed GPT-3.5 in AIS screening (AUC 0.75 (0.65-0.84) vs 0.59 (0.50-0.69), P = 0.015) and LVO identification (AUC 0.71 (0.65-0.77) vs 0.60 (0.53-0.66), P < 0.001). GPT-4 achieved higher accuracy than GPT-3.5 in screening of AIS (89.3% [95% CI: 85.8, 91.9] vs 86.5% [95% CI: 82.8, 89.5]) and LVO stroke identification (67.0% [95% CI: 62.3%, 71.4%] vs 47.3% [95% CI: 42.4%, 52.2%]). In neurological reasoning, GPT-4 had higher Likert scale scores for factual correctness (4.24 vs 3.62), with a lower rate of error (6.8% vs 24.8%) than GPT-3.5 (all P < 0.001). CONCLUSIONS: The result demonstrates that LLMs possess diagnostic capability in the prehospital identification of ischemic stroke, with the ability to exhibit neurologically informed reasoning processes. Notably, GPT-4 outperforms GPT-3.5 in the recognition of AIS and LVO stroke, achieving results comparable to prehospital scales. LLMs are supposed to become a promising supportive decision-making tool for EMS practitioners in screening prehospital stroke.",Wang X; Ye S; Feng J; Feng K; Yang H; Li H
39731895,"Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses.",2025,The American journal of emergency medicine,,,,"BACKGROUND: The number of emergency department (ED) visits has been on steady increase globally. Artificial Intelligence (AI) technologies, including Large Language Model (LLMs)-based generative AI models, have shown promise in improving triage accuracy. This study evaluates the performance of ChatGPT and Copilot in triage at a high-volume urban hospital, hypothesizing that these tools can match trained physicians' accuracy and reduce human bias amidst ED crowding challenges. METHODS: This single-center, prospective observational study was conducted in an urban ED over one week. Adult patients were enrolled through random 24-h intervals. Exclusions included minors, trauma cases, and incomplete data. Triage nurses assessed patients while an emergency medicine (EM) physician documented clinical vignettes and assigned emergency severity index (ESI) levels. These vignettes were then introduced to ChatGPT and Copilot for comparison with the triage nurse's decision. RESULTS: The overall triage accuracy was 65.2 % for nurses, 66.5 % for ChatGPT, and 61.8 % for Copilot, with no significant difference (p = 0.000). Moderate agreement was observed between the EM physician and ChatGPT, triage nurses, and Copilot (Cohen's Kappa = 0.537, 0.477, and 0.472, respectively). In recognizing high-acuity patients, ChatGPT and Copilot outperformed triage nurses (87.8 % and 85.7 % versus 32.7 %, respectively). Compared to ChatGPT and Copilot, nurses significantly under-triaged patients (p < 0.05). The analysis of predictive performance for ChatGPT, Copilot, and triage nurses demonstrated varying discrimination abilities across ESI levels, all of which were statistically significant (p < 0.05). ChatGPT and Copilot exhibited consistent accuracy across age, gender, and admission time, whereas triage nurses were more likely to mistriage patients under 45 years old. CONCLUSION: ChatGPT and Copilot outperform traditional nurse triage in identifying high-acuity patients, but real-time ED capacity data is crucial to prevent overcrowding and ensure high-quality of emergency care.",Arslan B; Nuhoglu C; Satici MO; Altinbilek E
38915174,Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases With Atypical Presentation: Descriptive Research.,2024,JMIR medical education,,,,"BACKGROUND: The persistence of diagnostic errors, despite advances in medical knowledge and diagnostics, highlights the importance of understanding atypical disease presentations and their contribution to mortality and morbidity. Artificial intelligence (AI), particularly generative pre-trained transformers like GPT-4, holds promise for improving diagnostic accuracy, but requires further exploration in handling atypical presentations. OBJECTIVE: This study aimed to assess the diagnostic accuracy of ChatGPT in generating differential diagnoses for atypical presentations of common diseases, with a focus on the model's reliance on patient history during the diagnostic process. METHODS: We used 25 clinical vignettes from the Journal of Generalist Medicine characterizing atypical manifestations of common diseases. Two general medicine physicians categorized the cases based on atypicality. ChatGPT was then used to generate differential diagnoses based on the clinical information provided. The concordance between AI-generated and final diagnoses was measured, with a focus on the top-ranked disease (top 1) and the top 5 differential diagnoses (top 5). RESULTS: ChatGPT's diagnostic accuracy decreased with an increase in atypical presentation. For category 1 (C1) cases, the concordance rates were 17% (n=1) for the top 1 and 67% (n=4) for the top 5. Categories 3 (C3) and 4 (C4) showed a 0% concordance for top 1 and markedly lower rates for the top 5, indicating difficulties in handling highly atypical cases. The chi2 test revealed no significant difference in the top 1 differential diagnosis accuracy between less atypical (C1+C2) and more atypical (C3+C4) groups (chi(2)1=2.07; n=25; P=.13). However, a significant difference was found in the top 5 analyses, with less atypical cases showing higher accuracy (chi(2)1=4.01; n=25; P=.048). CONCLUSIONS: ChatGPT-4 demonstrates potential as an auxiliary tool for diagnosing typical and mildly atypical presentations of common diseases. However, its performance declines with greater atypicality. The study findings underscore the need for AI systems to encompass a broader range of linguistic capabilities, cultural understanding, and diverse clinical scenarios to improve diagnostic utility in real-world settings.",Shikino K; Shimizu T; Otsuka Y; Tago M; Takahashi H; Watari T; Sasaki Y; Iizuka G; Tamura H; Nakashima K; Kunitomo K; Suzuki M; Aoyama S; Kosaka S; Kawahigashi T; Matsumoto T; Orihara F; Morikawa T; Nishizawa T; Hoshina Y; Yamamoto Y; Matsuo Y; Unoki Y; Kimura H; Tokushima M; Watanuki S; Saito T; Otsuka F; Tokuda Y
38329802,Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.,2024,JMIR medical education,,,,"BACKGROUND: The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. OBJECTIVE: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. METHODS: To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. RESULTS: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. CONCLUSIONS: The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.",Meyer A; Riese J; Streichert T
40158187,Generating learning guides for medical education with LLMs and statistical analysis of test results.,2025,BMC medical education,,,,"BACKGROUND: The Progress Test Medizin (PTM) is a formative test for medical students issued twice a year by the Charite-Universitatsmedizin Berlin. The PTM provides a numerical feedback based on a global view of the strengths and weaknesses of students. This feedback can benefit from more fine-grained information, pinpointing the topics where students need to improve, as well as advice on what they should learn in light of their results. The scale of the PTM, taken by more than 10,000 participants every academic semester, makes it necessary to automate this task. METHODS: We have developed a seven-step approach based on large language models and statistical analysis to fulfil the purpose of this study. Firstly, a large language model (ChatGPT 4.0) identified keywords in the form of MeSH terms from all 200 questions of one PTM run. These keywords were checked against the list of medical terms included in the Medical Subject Headings (MeSH) thesaurus published by the National Library of Medicine (NLM). Meanwhile, answer patterns of PTM questions were also analysed to find empirical relationships between questions. With this information, we obtained series of questions related to specific MeSH terms and used them to develop a framework that allowed us to assess the performance of PTM participants and compose personalized feedback structured around a curated list of medical topics. RESULTS: We used data from a past PTM to simulate the generation of personalized feedback for 1,401 test participants, thereby producing specific information about their knowledge regarding a number of topics ranging from 34 to 243. Substantial knowledge gaps were found in 14.67% to 21.76% of rated learning topics, depending on the benchmarking set considered. CONCLUSION: We designed and tested a method to generate student feedback covering up to 243 medical topics defined by MeSH terms. The feedback generated with data from students in later stages of their studies was more detailed, as they tend to face more questions matching their knowledge level.",Rosello Atanet I; Tomova M; Sieg M; Sehy V; Mader P; Marz M
40008118,Evaluating the Use of Artificial Intelligence as a Study Tool for Preclinical Medical School Exams.,2025,Journal of medical education and curricular development,,,,"BACKGROUND: The purpose of this 2024 study was to determine if there is an association between the usage of artificial intelligence (AI) to study and exam scores of medical students in the preclinical phase of their schooling. METHODS: We created and distributed a survey via an unbiased third-party to students in the class of 2027 at the Kirk Kerkorian School of Medicine at UNLV to evaluate students AI use to study for their preclinical system-based exams. Students were categorized into two groups, those that use AI to study and those who do not. Two-sample t-tests were run to compare the mean exam scores of both groups on six different organ system exams as well as the cumulative final exam score for each group. The group that did use AI was further asked about which AI tools they use and how exactly they use these tools to study for preclinical examinations. RESULTS: The results of the study showed that there is no statistically significant difference in exam scores between students who use AI for study purposes and students who do not. It was also found that most AI users studied with ChatGPT. The most common way users studied was by using AI to simplify and clarify topics they did not understand. CONCLUSIONS: Based on the results of this study, we concluded that usage of AI programs for students for medical examinations did not yield a positive or negative effect on students' organ system-based exam scores.",Sakelaris PG; Novotny KV; Borvick MS; Lagasca GG; Simanton EG
39221376,Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination.,2024,Cureus,,,,"BACKGROUND: The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers. METHODS: The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into ""clinical cases"" and ""other"" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's chi(2) test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels. RESULTS: ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields. CONCLUSIONS: ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.",Jaworski A; Jasinski D; Jaworski W; Hop A; Janek A; Slawinska B; Konieczniak L; Rzepka M; Jung M; Syslo O; Jarzabek V; Blecha Z; Harazinski K; Jasinska N
39042876,Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination.,2024,JMIR medical education,,,,"BACKGROUND: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. OBJECTIVE: This study aimed to evaluate ChatGPT's performance in a pulmonology examination through a comparative analysis with that of third-year medical students. METHODS: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution's 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. RESULTS: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. CONCLUSIONS: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources.",Cherif H; Moussa C; Missaoui AM; Salouage I; Mokaddem S; Dhahri B
39348189,Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study.,2024,Journal of medical Internet research,,,,"BACKGROUND: The release of ChatGPT (OpenAI) in November 2022 drastically reduced the barrier to using artificial intelligence by allowing a simple web-based text interface to a large language model (LLM). One use case where ChatGPT could be useful is in triaging patients at the site of a disaster using the Simple Triage and Rapid Treatment (START) protocol. However, LLMs experience several common errors including hallucinations (also called confabulations) and prompt dependency. OBJECTIVE: This study addresses the research problem: ""Can ChatGPT adequately triage simulated disaster patients using the START protocol?"" by measuring three outcomes: repeatability, reproducibility, and accuracy. METHODS: Nine prompts were developed by 5 disaster medicine physicians. A Python script queried ChatGPT Version 4 for each prompt combined with 391 validated simulated patient vignettes. Ten repetitions of each combination were performed for a total of 35,190 simulated triages. A reference standard START triage code for each simulated case was assigned by 2 disaster medicine specialists (JMF and MV), with a third specialist (LC) added if the first two did not agree. Results were evaluated using a gage repeatability and reproducibility study (gage R and R). Repeatability was defined as variation due to repeated use of the same prompt. Reproducibility was defined as variation due to the use of different prompts on the same patient vignette. Accuracy was defined as agreement with the reference standard. RESULTS: Although 35,102 (99.7%) queries returned a valid START score, there was considerable variability. Repeatability (use of the same prompt repeatedly) was 14% of the overall variation. Reproducibility (use of different prompts) was 4.1% of the overall variation. The accuracy of ChatGPT for START was 63.9% with a 32.9% overtriage rate and a 3.1% undertriage rate. Accuracy varied by prompt with a maximum of 71.8% and a minimum of 46.7%. CONCLUSIONS: This study indicates that ChatGPT version 4 is insufficient to triage simulated disaster patients via the START protocol. It demonstrated suboptimal repeatability and reproducibility. The overall accuracy of triage was only 63.9%. Health care professionals are advised to exercise caution while using commercial LLMs for vital medical determinations, given that these tools may commonly produce inaccurate data, colloquially referred to as hallucinations or confabulations. Artificial intelligence-guided tools should undergo rigorous statistical evaluation-using methods such as gage R and R-before implementation into clinical settings.",Franc JM; Hertelendy AJ; Cheng L; Hata R; Verde M
38055323,Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study.,2023,JMIR medical education,,,,"BACKGROUND: The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages. OBJECTIVE: This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE). METHODS: We used the GPT-4 model provided by OpenAI and the GM-ITE examination questions for the years 2020, 2021, and 2022 to conduct a comparative analysis. This analysis focused on evaluating the performance of individuals who were concluding their second year of residency in comparison to that of GPT-4. Given the current abilities of GPT-4, our study included only single-choice exam questions, excluding those involving audio, video, or image data. The assessment included 4 categories: general theory (professionalism and medical interviewing), symptomatology and clinical reasoning, physical examinations and clinical procedures, and specific diseases. Additionally, we categorized the questions into 7 specialty fields and 3 levels of difficulty, which were determined based on residents' correct response rates. RESULTS: Upon examination of 137 GM-ITE questions in Japanese, GPT-4 scores were significantly higher than the mean scores of residents (residents: 55.8%, GPT-4: 70.1%; P<.001). In terms of specific disciplines, GPT-4 scored 23.5 points higher in the ""specific diseases,"" 30.9 points higher in ""obstetrics and gynecology,"" and 26.1 points higher in ""internal medicine."" In contrast, GPT-4 scores in ""medical interviewing and professionalism,"" ""general practice,"" and ""psychiatry"" were lower than those of the residents, although this discrepancy was not statistically significant. Upon analyzing scores based on question difficulty, GPT-4 scores were 17.2 points lower for easy problems (P=.007) but were 25.4 and 24.4 points higher for normal and difficult problems, respectively (P<.001). In year-on-year comparisons, GPT-4 scores were 21.7 and 21.5 points higher in the 2020 (P=.01) and 2022 (P=.003) examinations, respectively, but only 3.5 points higher in the 2021 examinations (no significant difference). CONCLUSIONS: In the Japanese language, GPT-4 also outperformed the average medical residents in the GM-ITE test, originally designed for them. Specifically, GPT-4 demonstrated a tendency to score higher on difficult questions with low resident correct response rates and those demanding a more comprehensive understanding of diseases. However, GPT-4 scored comparatively lower on questions that residents could readily answer, such as those testing attitudes toward patients and professionalism, as well as those necessitating an understanding of context and communication. These findings highlight the strengths and limitations of artificial intelligence applications in medical education and practice.",Watari T; Takagi S; Sakaguchi K; Nishizaki Y; Shimizu T; Yamamoto Y; Tokuda Y
38295466,Assessing the precision of artificial intelligence in ED triage decisions: Insights from a study with ChatGPT.,2024,The American journal of emergency medicine,,,,"BACKGROUND: The rise in emergency department presentations globally poses challenges for efficient patient management. To address this, various strategies aim to expedite patient management. Artificial intelligence's (AI) consistent performance and rapid data interpretation extend its healthcare applications, especially in emergencies. The introduction of a robust AI tool like ChatGPT, based on GPT-4 developed by OpenAI, can benefit patients and healthcare professionals by improving the speed and accuracy of resource allocation. This study examines ChatGPT's capability to predict triage outcomes based on local emergency department rules. METHODS: This study is a single-center prospective observational study. The study population consists of all patients who presented to the emergency department with any symptoms and agreed to participate. The study was conducted on three non-consecutive days for a total of 72 h. Patients' chief complaints, vital parameters, medical history and the area to which they were directed by the triage team in the emergency department were recorded. Concurrently, an emergency medicine physician inputted the same data into previously trained GPT-4, according to local rules. According to this data, the triage decisions made by GPT-4 were recorded. In the same process, an emergency medicine specialist determined where the patient should be directed based on the data collected, and this decision was considered the gold standard. Accuracy rates and reliability for directing patients to specific areas by the triage team and GPT-4 were evaluated using Cohen's kappa test. Furthermore, the accuracy of the patient triage process performed by the triage team and GPT-4 was assessed by receiver operating characteristic (ROC) analysis. Statistical analysis considered a value of p < 0.05 as significant. RESULTS: The study was carried out on 758 patients. Among the participants, 416 (54.9%) were male and 342 (45.1%) were female. Evaluating the primary endpoints of our study - the agreement between the decisions of the triage team, GPT-4 decisions in emergency department triage, and the gold standard - we observed almost perfect agreement both between the triage team and the gold standard and between GPT-4 and the gold standard (Cohen's Kappa 0.893 and 0.899, respectively; p < 0.001 for each). CONCLUSION: Our findings suggest GPT-4 possess outstanding predictive skills in triaging patients in an emergency setting. GPT-4 can serve as an effective tool to support the triage process.",Pasli S; Sahin AS; Beser MF; Topcuoglu H; Yadigaroglu M; Imamoglu M
38472675,Enhanced Artificial Intelligence Strategies in Renal Oncology: Iterative Optimization and Comparative Analysis of GPT 3.5 Versus 4.0.,2024,Annals of surgical oncology,,,,"BACKGROUND: The rise of artificial intelligence (AI) in medicine has revealed the potential of ChatGPT as a pivotal tool in medical diagnosis and treatment. This study assesses the efficacy of ChatGPT versions 3.5 and 4.0 in addressing renal cell carcinoma (RCC) clinical inquiries. Notably, fine-tuning and iterative optimization of the model corrected ChatGPT's limitations in this area. METHODS: In our study, 80 RCC-related clinical questions from urology experts were posed three times to both ChatGPT 3.5 and ChatGPT 4.0, seeking binary (yes/no) responses. We then statistically analyzed the answers. Finally, we fine-tuned the GPT-3.5 Turbo model using these questions, and assessed its training outcomes. RESULTS: We found that the average accuracy rates of answers provided by ChatGPT versions 3.5 and 4.0 were 67.08% and 77.50%, respectively. ChatGPT 4.0 outperformed ChatGPT 3.5, with a higher accuracy rate in responses (p < 0.05). By counting the number of correct responses to the 80 questions, we then found that although ChatGPT 4.0 performed better (p < 0.05), both versions were subject to instability in answering. Finally, by fine-tuning the GPT-3.5 Turbo model, we found that the correct rate of responses to these questions could be stabilized at 93.75%. Iterative optimization of the model can result in 100% response accuracy. CONCLUSION: We compared ChatGPT versions 3.5 and 4.0 in addressing clinical RCC questions, identifying their limitations. By applying the GPT-3.5 Turbo fine-tuned model iterative training method, we enhanced AI strategies in renal oncology. This approach is set to enhance ChatGPT's database and clinical guidance capabilities, optimizing AI in this field.",Liang R; Zhao A; Peng L; Xu X; Zhong J; Wu F; Yi F; Zhang S; Wu S; Hou J
39610054,Impact of artificial intelligence in managing musculoskeletal pathologies in physiatry: a qualitative observational study evaluating the potential use of ChatGPT versus Copilot for patient information and clinical advice on low back pain.,2025,Journal of Yeungnam medical science,,,,"BACKGROUND: The self-management of low back pain (LBP) through patient information interventions offers significant benefits in terms of cost, reduced work absenteeism, and overall healthcare utilization. Using a large language model (LLM), such as ChatGPT (OpenAI) or Copilot (Microsoft), could potentially enhance these outcomes further. Thus, it is important to evaluate the LLMs ChatGPT and Copilot in providing medical advice for LBP and assessing the impact of clinical context on the quality of responses. METHODS: This was a qualitative comparative observational study. It was conducted within the Department of Physical Medicine and Rehabilitation, University of Montreal in Montreal, QC, Canada. ChatGPT and Copilot were used to answer 27 common questions related to LBP, with and without a specific clinical context. The responses were evaluated by physiatrists for validity, safety, and usefulness using a 4-point Likert scale (4, most favorable). RESULTS: Both ChatGPT and Copilot demonstrated good performance across all measures. Validity scores were 3.33 for ChatGPT and 3.18 for Copilot, safety scores were 3.19 for ChatGPT and 3.13 for Copilot, and usefulness scores were 3.60 for ChatGPT and 3.57 for Copilot. The inclusion of clinical context did not significantly change the results. CONCLUSION: LLMs, such as ChatGPT and Copilot, can provide reliable medical advice on LBP, irrespective of the detailed clinical context, supporting their potential to aid in patient self-management.",Ah-Yan C; Boissonnault E; Boudier-Reveret M; Mares C
38180782,Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.,2024,JMIR medical education,,,,"BACKGROUND: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student's knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT's performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. OBJECTIVE: This paper aimed to analyze ChatGPT's performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. METHODS: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. RESULTS: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (rho=-0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with rho=-0.289 for ChatGPT 3.5 and rho=-0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. CONCLUSIONS: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.",Knoedler L; Alfertshofer M; Knoedler S; Hoch CC; Funk PF; Cotofana S; Maheta B; Frank K; Brebant V; Prantl L; Lamby P
38823007,Enhancing readability of USFDA patient communications through large language models: a proof-of-concept study.,2024,Expert review of clinical pharmacology,,,,"BACKGROUND: The US Food and Drug Administration (USFDA) communicates new drug safety concerns through drug safety communications (DSCs) and medication guides (MGs), which often challenge patients with average reading abilities due to their complexity. This study assesses whether large language models (LLMs) can enhance the readability of these materials. METHODS: We analyzed the latest DSCs and MGs, using ChatGPT 4.0(c) and Gemini(c) to simplify them to a sixth-grade reading level. Outputs were evaluated for readability, technical accuracy, and content inclusiveness. RESULTS: Original materials were difficult to read (DSCs grade level 13, MGs 22). LLMs significantly improved readability, reducing the grade levels to more accessible readings (Single prompt - DSCs: ChatGPT 4.0(c) 10.1, Gemini(c) 8; MGs: ChatGPT 4.0(c) 7.1, Gemini(c) 6.5. Multiple prompts - DSCs: ChatGPT 4.0(c) 10.3, Gemini(c) 7.5; MGs: ChatGPT 4.0(c) 8, Gemini(c) 6.8). LLM outputs retained technical accuracy and key messages. CONCLUSION: LLMs can significantly simplify complex health-related information, making it more accessible to patients. Future research should extend these findings to other languages and patient groups in real-world settings.",Sridharan K; Sivaramakrishnan G
40093710,Advancing personalized medicine in digital health: The role of artificial intelligence in enhancing clinical interpretation of 24-h ambulatory blood pressure monitoring.,2025,Digital health,,,,"Background: The use of artificial intelligence (AI) for interpreting ambulatory blood pressure monitoring (ABPM) data is gaining traction in clinical practice. Evaluating the accuracy of AI models, like ChatGPT 4.0, in clinical settings can inform their integration into healthcare processes. However, limited research has been conducted to validate the performance of such models against expert interpretations in real-world clinical scenarios. Methods: A total of 53 ABPM records from Mayo Clinic, Minnesota, were analyzed. ChatGPT 4.0's interpretations were compared with consensus results from two experienced nephrologists, based on the American College of Cardiology/American Heart Association (ACC/AHA) guidelines. The study assessed ChatGPT's accuracy and reliability over two rounds of testing, with a three-month interval between rounds. Results: ChatGPT achieved an accuracy of 87% for identifying hypertension, 89% for nocturnal hypertension, 81% for nocturnal dipping, and 94% for abnormal heart rate. ChatGPT correctly identified all conditions in 60% of ABPM records. The percentage agreement between the first and second round of ChatGPT's analysis was 81% in identifying hypertension, 85% in nocturnal hypertension, 89% in nocturnal dipping, and 94% in abnormal heart rate. There was no significant difference in accuracy between the first and second round (all p > 0.05). The Kappa statistic was 0.63 for identifying hypertension, 0.66 for nocturnal hypertension, 0.76 for nocturnal dipping, and 0.70 for abnormal heart rate. Conclusions: ChatGPT 4.0 demonstrates potential as a reliable tool for interpreting 24-h ABPM data, achieving substantial agreement with expert nephrologists. These findings underscore the potential for AI integration into hypertension management workflows, while highlighting the need for further validation in larger, diverse cohorts.",Alam SF; Thongprayoon C; Miao J; Pham JH; Sheikh MS; Garcia Valencia OA; Schwartz GL; Craici IM; Gonzalez Suarez ML; Cheungpasitporn W
38180787,"Artificial Intelligence in Medicine: Cross-Sectional Study Among Medical Students on Application, Education, and Ethical Aspects.",2024,JMIR medical education,,,,"BACKGROUND: The use of artificial intelligence (AI) in medicine not only directly impacts the medical profession but is also increasingly associated with various potential ethical aspects. In addition, the expanding use of AI and AI-based applications such as ChatGPT demands a corresponding shift in medical education to adequately prepare future practitioners for the effective use of these tools and address the associated ethical challenges they present. OBJECTIVE: This study aims to explore how medical students from Germany, Austria, and Switzerland perceive the use of AI in medicine and the teaching of AI and AI ethics in medical education in accordance with their use of AI-based chat applications, such as ChatGPT. METHODS: This cross-sectional study, conducted from June 15 to July 15, 2023, surveyed medical students across Germany, Austria, and Switzerland using a web-based survey. This study aimed to assess students' perceptions of AI in medicine and the integration of AI and AI ethics into medical education. The survey, which included 53 items across 6 sections, was developed and pretested. Data analysis used descriptive statistics (median, mode, IQR, total number, and percentages) and either the chi-square or Mann-Whitney U tests, as appropriate. RESULTS: Surveying 487 medical students across Germany, Austria, and Switzerland revealed limited formal education on AI or AI ethics within medical curricula, although 38.8% (189/487) had prior experience with AI-based chat applications, such as ChatGPT. Despite varied prior exposures, 71.7% (349/487) anticipated a positive impact of AI on medicine. There was widespread consensus (385/487, 74.9%) on the need for AI and AI ethics instruction in medical education, although the current offerings were deemed inadequate. Regarding the AI ethics education content, all proposed topics were rated as highly relevant. CONCLUSIONS: This study revealed a pronounced discrepancy between the use of AI-based (chat) applications, such as ChatGPT, among medical students in Germany, Austria, and Switzerland and the teaching of AI in medical education. To adequately prepare future medical professionals, there is an urgent need to integrate the teaching of AI and AI ethics into the medical curricula.",Weidener L; Fischer M
38049066,Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management.,2024,Journal of the American Pharmacists Association : JAPhA,,,,"BACKGROUND: The use of artificial intelligence (AI) to optimize medication therapy management (MTM) in identifying drug interactions may potentially improve MTM efficiency. ChatGPT, an AI language model, may be applied to identify medication interventions by integrating patient and drug databases. ChatGPT has been shown to be effective in other areas of clinical medicine, from diagnosis to patient management. However, ChatGPT's ability to manage MTM related activities is little known. OBJECTIVES: To evaluate the effectiveness of ChatGPT in MTM services in simple, complex, and very complex cases to understand AI contributions in MTM. METHODS: Two clinical pharmacists rated and validated the difficulty of patient cases from simple, complex, and very complex. ChatGPT's response to the cases was assessed based on 3 criteria: the ability to identify drug interactions, precision in recommending alternatives, and appropriateness in devising management plans. Two clinical pharmacists validated the accuracy of ChatGPT's responses and compared them to actual answers for each complexity level. RESULTS: ChatGPT 4.0 accurately solved 39 out of 39 (100 %) patient cases. ChatGPT successfully identified drug interactions, provided therapy recommendations and formulated general management plans, but it did not recommend specific dosages. Results suggest it can assist pharmacists in formulating MTM plans to improve overall efficiency. CONCLUSION: The application of ChatGPT in MTM has the potential to enhance patient safety and involvement, lower healthcare costs, and assist healthcare providers in medication management and identifying drug interactions. Future pharmacists can utilize AI models such as ChatGPT to improve patient care. The future of the pharmacy profession will depend on how the field responds to the changing need for patient care optimized by AI and automation.",Roosan D; Padua P; Khan R; Khan H; Verzosa C; Wu Y
38789962,Evaluating the accuracy of Chat Generative Pre-trained Transformer version 4 (ChatGPT-4) responses to United States Food and Drug Administration (FDA) frequently asked questions about dental amalgam.,2024,BMC oral health,,,,"BACKGROUND: The use of artificial intelligence in the field of health sciences is becoming widespread. It is known that patients benefit from artificial intelligence applications on various health issues, especially after the pandemic period. One of the most important issues in this regard is the accuracy of the information provided by artificial intelligence applications. OBJECTIVE: The purpose of this study was to the frequently asked questions about dental amalgam, as determined by the United States Food and Drug Administration (FDA), which is one of these information resources, to Chat Generative Pre-trained Transformer version 4 (ChatGPT-4) and to compare the content of the answers given by the application with the answers of the FDA. METHODS: The questions were directed to ChatGPT-4 on May 8th and May 16th, 2023, and the responses were recorded and compared at the word and meaning levels using ChatGPT. The answers from the FDA webpage were also recorded. The responses were compared for content similarity in ""Main Idea"", ""Quality Analysis"", ""Common Ideas"", and ""Inconsistent Ideas"" between ChatGPT-4's responses and FDA's responses. RESULTS: ChatGPT-4 provided similar responses at one-week intervals. In comparison with FDA guidance, it provided answers with similar information content to frequently asked questions. However, although there were some similarities in the general aspects of the recommendation regarding amalgam removal in the question, the two texts are not the same, and they offered different perspectives on the replacement of fillings. CONCLUSIONS: The findings of this study indicate that ChatGPT-4, an artificial intelligence based application, encompasses current and accurate information regarding dental amalgam and its removal, providing it to individuals seeking access to such information. Nevertheless, we believe that numerous studies are required to assess the validity and reliability of ChatGPT-4 across diverse subjects.",Buldur M; Sezer B
40235859,Evaluating the accuracy of ChatGPT in delivering patient instructions for medications: an exploratory case study.,2025,Frontiers in artificial intelligence,,,,"BACKGROUND: The use of ChatGPT in healthcare is still in its early stages; however, it has the potential to become a cornerstone in modern healthcare systems. This study aims to assess the accuracy of output of ChatGPT compared with those of CareNotes((R)) in providing patient instructions for three medications: tirzepatide, citalopram, and apixaban. METHODS: An exploratory case study was conducted using a published questionnaire to evaluate ChatGPT-generated reports against patient instructions from CareNotes((R)). The evaluation focused on the completeness and correctness of the reports, as well as their potential to cause harm or lead to poor medication adherence. The evaluation was conducted by four pharmacy experts and 33 PharmD interns. RESULTS: The evaluators indicated that the ChatGPT reports of tirzepatide, citalopram, and apixaban were correct but lacked completeness. Additionally, ChatGPT reports have the potential to cause harm and may negatively affect medication adherence. CONCLUSION: Although ChatGPT demonstrated promising results, particularly in terms of correctness, it cannot yet be considered a reliable standalone source of patient drug information.",Abanmy NO; Al-Ghreimil N; Alsabhan JF; Al-Baity H; Aljadeed R
40023616,Can ChatGPT-4 perform as a competent physician based on the Chinese critical care examination?,2025,Journal of critical care,,,,"BACKGROUND: The use of ChatGPT in medical applications is of increasing interest. However, its efficacy in critical care medicine remains uncertain. This study aims to assess ChatGPT-4's performance in critical care examination, providing insights into its potential as a tool for clinical decision-making. METHODS: A dataset from the Chinese Health Professional Technical Qualification Examination for Critical Care Medicine, covering four components-fundamental knowledge, specialized knowledge, professional practical skills, and related medical knowledge-was utilized. ChatGPT-4 answered 600 questions, which were evaluated by critical care experts using a standardized rubric. RESULTS: ChatGPT-4 achieved a 73.5 % success rate, surpassing the 60 % passing threshold in four components, with the highest accuracy in fundamental knowledge (81.94 %). ChatGPT-4 performed significantly better on single-choice questions than on multiple-choice questions (76.72 % vs. 51.32 %, p < 0.001), while no significant difference was observed between case-based and non-case-based questions. CONCLUSION: ChatGPT demonstrated notable strengths in critical care examination, highlighting its potential for supporting clinical decision-making, information retrieval, and medical education. However, caution is required regarding its potential to generate inaccurate responses. Its application in critical care must therefore be carefully supervised by medical professionals to ensure both the accuracy of the information and patient safety.",Wang X; Tang J; Feng Y; Tang C; Wang X
37860071,Applications of ChatGPT and Large Language Models in Medicine and Health Care: Benefits and Pitfalls.,2023,"Federal practitioner : for the health care professionals of the VA, DoD, and PHS",,,,"BACKGROUND: The use of large language models like ChatGPT is becoming increasingly popular in health care settings. These artificial intelligence models are trained on vast amounts of data and can be used for various tasks, such as language translation, summarization, and answering questions. OBSERVATIONS: Large language models have the potential to revolutionize the industry by assisting medical professionals with administrative tasks, improving diagnostic accuracy, and engaging patients. However, pitfalls exist, such as its inability to distinguish between real and fake information and the need to comply with privacy, security, and transparency principles. CONCLUSIONS: Careful consideration is needed to ensure the responsible and ethical use of large language models in medicine and health care. The development of [artificial intelligence] is as fundamental as the creation of the microprocessor, the personal computer, the Internet, and the mobile phone. It will change the way people work, learn, travel, get health care, and communicate with each other. Bill Gates1.",Borkowski AA; Jakey CE; Mastorides SM; Kraus AL; Vidyarthi G; Viswanadhan N; Lezama JL
38691128,[ChatGPT for use in technology-enhanced learning in anesthesiology and emergency medicine and potential clinical application of AI language models : Between hype and reality around artificial intelligence in medical use].,2024,Die Anaesthesiologie,,,,"BACKGROUND: The utilization of AI language models in education and academia is currently a subject of research, and applications in clinical settings are also being tested. Studies conducted by various research groups have demonstrated that language models can answer questions related to medical board examinations, and there are potential applications of these models in medical education as well. RESEARCH QUESTION: This study aims to investigate the extent to which current version language models prove effective for addressing medical inquiries, their potential utility in medical education, and the challenges that still exist in the functioning of AI language models. METHOD: The program ChatGPT, based on GPT 3.5, had to answer 1025 questions from the second part (M2) of the medical board examination. The study examined whether any errors and what types of errors occurred. Additionally, the language model was asked to generate essays on the learning objectives outlined in the standard curriculum for specialist training in anesthesiology and the supplementary qualification in emergency medicine. These essays were analyzed afterwards and checked for errors and anomalies. RESULTS: The findings indicated that ChatGPT was able to correctly answer the questions with an accuracy rate exceeding 69%, even when the questions included references to visual aids. This represented an improvement in the accuracy of answering board examination questions compared to a study conducted in March; however, when it came to generating essays a high error rate was observed. DISCUSSION: Considering the current pace of ongoing improvements in AI language models, widespread clinical implementation, especially in emergency departments as well as emergency and intensive care medicine with the assistance of medical trainees, is a plausible scenario. These models can provide insights to support medical professionals in their work, without relying solely on the language model. Although the use of these models in education holds promise, it currently requires a significant amount of supervision. Due to hallucinations caused by inadequate training environments for the language model, the generated texts might deviate from the current state of scientific knowledge. Direct deployment in patient care settings without permanent physician supervision does not yet appear to be achievable at present.",Humbsch P; Horn E; Bohm K; Gintrowicz R
40432855,Performance of Artificial Intelligence in Addressing Questions Regarding the Management of Pediatric Supracondylar Humerus Fractures.,2025,Journal of the Pediatric Orthopaedic Society of North America,,,,"BACKGROUND: The vast accessibility of artificial intelligence (AI) has enabled its utilization in medicine to improve patient education, augment patient-physician communications, support research efforts, and enhance medical student education. However, there is significant concern that these models may provide responses that are incorrect, biased, or lacking in the required nuance and complexity of best practice clinical decision-making. Currently, there is a paucity of literature comparing the quality and reliability of AI-generated responses. The purpose of this study was to assess the ability of ChatGPT and Gemini to generate reponses to the 2022 American Academy of Orthopaedic Surgeons' (AAOS) current practice guidlines on pediatric supracondylar humerus fractures. We hypothesized that both ChatGPT and Gemini would demonstrate high-quality, evidence-based responses with no significant difference between the models across evaluation criteria. METHODS: The responses from ChatGPT and Gemini to responses based on the 14 AAOS guidelines were evaluated by seven fellowship-trained pediatric orthopaedic surgeons using a questionnaire to assess five key characteristics on a scale from 1 to 5. The prompts were categorized into nonoperative or preoperative management and diagnosis, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and two-sided t-tests to compare the performance between ChatGPT and Gemini. Scores were then evaluated for inter-rater reliability. RESULTS: ChatGPT and Gemini demonstrated consistent performance across the criteria, with high mean scores across all criteria except for evidence-based responses. Mean scores were highest for clarity (ChatGPT: 3.745 +/- 0.237, Gemini 4.388 +/- 0.154) and lowest for evidence-based responses (ChatGPT: 1.816 +/- 0.181, Gemini: 3.765 +/- 0.229). There were notable statistically significant differences across all criteria, with Gemini having higher mean scores in each criterion (P < .001). Gemini achieved statistically higher ratings in the relevance (P = .03) and evidence-based (P < .001) criteria. Both large language models (LLMs) performed comparably in the accuracy, clarity, and completeness criteria (P > .05). CONCLUSIONS: ChatGPT and Gemini produced responses aligned with the 2022 AAOS current guideline practices for pediatric supracondylar humerus fractures. Gemini outperformed ChatGPT across all criteria, with the greatest difference in scores seen in the evidence-based category. This study emphasizes the potential for LLMs, particularly Gemini, to provide pertinent clinical information for managing pediatric supracondylar humerus fractures. KEY CONCEPTS: (1)The accessibility of artificial intelligence has enabled its utilization in medicine to improve patient education, support research efforts, enhance medical student education, and augment patient-physician communications.(2)There is a significant concern that artificial intelligence may provide responses that are incorrect, biased, or lacking in the required nuance and complexity of best practice clinical decision-making.(3)There is a paucity of literature comparing the quality and reliability of AI-generated responses regarding management of pediatric supracondylar humerus fractures.(4)In our study, both ChatGPT and Gemini produced responses that were well aligned with the AAOS current guideline practices for pediatric supracondylar humerus fractures; however, Gemini outperformed ChatGPT across all criteria, with the greatest difference in scores seen in the evidence-based category. LEVEL OF EVIDENCE: Level II.",Milner JD; Quinn MS; Schmitt P; Knebel A; Henstenburg J; Nasreddine A; Boulos AR; Schiller JR; Eberson CP; Cruz AI Jr
40318209,Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions.,2025,"PM & R : the journal of injury, function, and rehabilitation",,,,"BACKGROUND: There have been significant advances in machine learning and artificial intelligence technology over the past few years, leading to the release of large language models (LLMs) such as ChatGPT. There are many potential applications for LLMs in health care, but it is critical to first determine how accurate LLMs are before putting them into practice. No studies have evaluated the accuracy and precision of LLMs in responding to questions related to the field of physical medicine and rehabilitation (PM&R). OBJECTIVE: To determine the accuracy and precision of two OpenAI LLMs (GPT-3.5, released in November 2022, and GPT-4o, released in May 2024) in answering questions related to PM&R knowledge. DESIGN: Cross-sectional study. Both LLMs were tested on the same 744 PM&R knowledge questions that covered all aspects of the field (general rehabilitation, stroke, traumatic brain injury, spinal cord injury, musculoskeletal medicine, pain medicine, electrodiagnostic medicine, pediatric rehabilitation, prosthetics and orthotics, rheumatology, and pharmacology). Each LLM was tested three times on the same question set to assess for precision. SETTING: N/A. PATIENTS: N/A. INTERVENTIONS: N/A. MAIN OUTCOME MEASURE: Percentage of correctly answered questions. RESULTS: For three runs of the 744-question set, GPT-3.5 answered 56.3%, 56.5%, and 56.9% of the questions correctly. For three runs of the same question set, GPT-4o answered 83.6%, 84%, and 84.1% of the questions correctly. GPT-4o outperformed GPT-3.5 in all subcategories of PM&R questions. CONCLUSIONS: LLM technology is rapidly advancing, with the more recent GPT-4o model performing much better on PM&R knowledge questions compared to GPT-3.5. There is potential for LLMs in augmenting clinical practice, medical training, and patient education. However, the technology has limitations and physicians should remain cautious in using it in practice at this time.",Bitterman J; D'Angelo A; Holachek A; Eubanks JE
39378442,Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study.,2024,JMIR medical education,,,,"BACKGROUND: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. OBJECTIVE: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. METHODS: In this study, ChatGPT-4 was embedded in a specialized subenvironment, ""AI Family Medicine Board Exam Taker,"" designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI's ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. RESULTS: In our study, ChatGPT-4's performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4's capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. CONCLUSIONS: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI.",Goodings AJ; Kajitani S; Chhor A; Albakri A; Pastrak M; Kodancha M; Ives R; Lee YB; Kajitani K
39238293,"ChatGPT in Clinical Medicine, Urology and Academia: A Review.",2024,Archivos espanoles de urologia,,,,"BACKGROUND: This study aims to provide a comprehensive overview of the current literature on the utilisation of ChatGPT in the fields of clinical medicine, urology, and academic medicine, while also addressing the associated ethical challenges and potential risks. METHODS: This narrative review conducted an extensive search of the PubMed and MEDLINE databases, covering the period from January 2022 to January 2024. The search phrases employed were ""urologic surgery"" in conjunction with ""artificial intelligence"", ""machine learning"", ""neural network"", ""ChatGPT"", ""urology"", and ""medicine"". The initial studies were chosen from the screened research to examine the possible interaction between those entities. Research utilising animal models was excluded. RESULTS: ChatGPT has demonstrated its usefulness in clinical settings by producing precise clinical correspondence, discharge summaries, and medical records, thereby assisting in these laborious tasks, especially with the latest iterations of ChatGPT. Furthermore, patients can access essential medical information by inquiring with ChatGPT. Nevertheless, there are multiple concerns regarding the correctness of the system, including allegations of falsified data and references. These issues emphasise the importance of having a doctor oversee the final result to guarantee patient safety. ChatGPT shows potential in academic medicine for generating drafts and organising datasets. However, the presence of guidelines and plagiarism-detection technologies is necessary to mitigate the risks of plagiarism and the use of faked data when using it for academic purposes. CONCLUSIONS: ChatGPT should be utilised as a supplementary tool by urologists and academicians. However, it is now advisable to have human oversight to guarantee patient safety, uphold academic integrity, and maintain transparency.",Tzelves L; Kapriniotis K; Feretzakis G; Katsimperis S; Manolitsis I; Juliebo-Jones P; Pietropaolo A; Tonyali S; Bellos T; Somani B
38928668,Google Bard and ChatGPT in Orthopedics: Which Is the Better Doctor in Sports Medicine and Pediatric Orthopedics? The Role of AI in Patient Education.,2024,"Diagnostics (Basel, Switzerland)",,,,"BACKGROUND: This study evaluates the potential of ChatGPT and Google Bard as educational tools for patients in orthopedics, focusing on sports medicine and pediatric orthopedics. The aim is to compare the quality of responses provided by these natural language processing (NLP) models, addressing concerns about the potential dissemination of incorrect medical information. METHODS: Ten ACL- and flat foot-related questions from a Google search were presented to ChatGPT-3.5 and Google Bard. Expert orthopedic surgeons rated the responses using the Global Quality Score (GQS). The study minimized bias by clearing chat history before each question, maintaining respondent anonymity and employing statistical analysis to compare response quality. RESULTS: ChatGPT-3.5 and Google Bard yielded good-quality responses, with average scores of 4.1 +/- 0.7 and 4 +/- 0.78, respectively, for sports medicine. For pediatric orthopedics, Google Bard scored 3.5 +/- 1, while the average score for responses generated by ChatGPT was 3.8 +/- 0.83. In both cases, no statistically significant difference was found between the platforms (p = 0.6787, p = 0.3092). Despite ChatGPT's responses being considered more readable, both platforms showed promise for AI-driven patient education, with no reported misinformation. CONCLUSIONS: ChatGPT and Google Bard demonstrate significant potential as supplementary patient education resources in orthopedics. However, improvements are needed for increased reliability. The study underscores the evolving role of AI in orthopedics and calls for continued research to ensure a conscientious integration of AI in healthcare education.",Giorgino R; Alessandri-Bonetti M; Del Re M; Verdoni F; Peretti GM; Mangiavini L
38638404,Evaluating ChatGPT's efficacy in assessing the safety of non-prescription medications and supplements in patients with kidney disease.,2024,Digital health,,,,"BACKGROUND: This study investigated the efficacy of ChatGPT-3.5 and ChatGPT-4 in assessing drug safety for patients with kidney diseases, comparing their performance to Micromedex, a well-established drug information source. Despite the perception of non-prescription medications and supplements as safe, risks exist, especially for those with kidney issues. The study's goal was to evaluate ChatGPT's versions for their potential in clinical decision-making regarding kidney disease patients. METHOD: The research involved analyzing 124 common non-prescription medications and supplements using ChatGPT-3.5 and ChatGPT-4 with queries about their safety for people with kidney disease. The AI responses were categorized as ""generally safe,"" ""potentially harmful,"" or ""unknown toxicity."" Simultaneously, these medications and supplements were assessed in Micromedex using similar categories, allowing for a comparison of the concordance between the two resources. RESULTS: Micromedex identified 85 (68.5%) medications as generally safe, 35 (28.2%) as potentially harmful, and 4 (3.2%) of unknown toxicity. ChatGPT-3.5 identified 89 (71.8%) as generally safe, 11 (8.9%) as potentially harmful, and 24 (19.3%) of unknown toxicity. GPT-4 identified 82 (66.1%) as generally safe, 29 (23.4%) as potentially harmful, and 13 (10.5%) of unknown toxicity. The overall agreement between Micromedex and ChatGPT-3.5 was 64.5% and ChatGPT-4 demonstrated a higher agreement at 81.4%. Notably, ChatGPT-3.5's suboptimal performance was primarily influenced by a lower concordance rate among supplements, standing at 60.3%. This discrepancy could be attributed to the limited data on supplements within ChatGPT-3.5, with supplements constituting 80% of medications identified as unknown. CONCLUSION: ChatGPT's capabilities in evaluating the safety of non-prescription drugs and supplements for kidney disease patients are modest compared to established drug information resources. Neither ChatGPT-3.5 nor ChatGPT-4 can be currently recommended as reliable drug information sources for this demographic. The results highlight the need for further improvements in the model's accuracy and reliability in the medical domain.",Sheikh MS; Barreto EF; Miao J; Thongprayoon C; Gregoire JR; Dreesman B; Erickson SB; Craici IM; Cheungpasitporn W
40083047,"While GPT-3.5 is unable to pass the Physician Licensing Exam in Taiwan, GPT-4 successfully meets the criteria.",2025,Journal of the Chinese Medical Association : JCMA,,,,"BACKGROUND: This study investigates the performance of ChatGPT-3.5 and ChatGPT-4 in answering medical questions from Taiwan's Physician Licensing Exam, ranging from basic medical knowledge to specialized clinical topics. It aims to understand these artificial intelligence (AI) models' capabilities in a non-English context, specifically traditional Chinese. METHODS: The study incorporated questions from the Taiwan Physician Licensing Exam in 2022, excluding image-based queries. Each question was manually input into ChatGPT, and responses were compared with official answers from Taiwan's Ministry of Examination. Differences across specialties and question types were assessed using the Kruskal-Wallis and Fisher's exact tests. RESULTS: ChatGPT-3.5 achieved an average accuracy of 67.7% in basic medical sciences and 53.2% in clinical medicine. Meanwhile, ChatGPT-4 significantly outperformed ChatGPT-3.5, with average accuracies of 91.9% and 90.7%, respectively. ChatGPT-3.5 scored above 60.0% in seven out of 10 basic medical science subjects and three of 14 clinical subjects, while ChatGPT-4 scored above 60.0% in every subject. The type of question did not significantly affect accuracy rates. CONCLUSION: ChatGPT-3.5 showed proficiency in basic medical sciences but was less reliable in clinical medicine, whereas ChatGPT-4 demonstrated strong capabilities in both areas. However, their proficiency varied across different specialties. The type of question had minimal impact on performance. This study highlights the potential of AI models in medical education and non-English languages examination and the need for cautious and informed implementation in educational settings due to variability across specialties.",Chen TA; Lin KC; Lin MH; Chang HT; Chen YC; Chen TJ
40273640,ChatGPT-supported patient triage with voice commands in the emergency department: A prospective multicenter study.,2025,The American journal of emergency medicine,,,,"BACKGROUND: Triage aims to prioritize patients according to their medical urgency by accurately evaluating their clinical conditions, managing waiting times efficiently, and improving the overall effectiveness of emergency care. This study aims to assess ChatGPT's performance in patient triage across four emergency departments with varying dynamics and to provide a detailed analysis of its strengths and weaknesses. METHODS: In this multicenter, prospective study, we compared the triage decisions made by ChatGPT-4o and the triage personnel with the gold standard decisions determined by an emergency medicine (EM) specialist. In the hospitals where we conducted the study, triage teams routinely direct patients to the appropriate ED areas based on the Emergency Severity Index (ESI) system and the hospital's local triage protocols. During the study period, the triage team collected patient data, including chief complaints, comorbidities, and vital signs, and used this information to make the initial triage decisions. An independent physician simultaneously entered the same data into ChatGPT using voice commands. At the same time, an EM specialist, present in the triage room throughout the study period, reviewed the same patient data and determined the gold standard triage decisions, strictly adhering to both the hospital's local protocols and the ESI system. Before initiating the study, we customized ChatGPT for each hospital by designing prompts that incorporated both the general principles of the ESI triage system and the specific triage rules of each hospital. The model's overall, hospital-based, and area-based performance was evaluated, with Cohen's Kappa, F1 score, and performance analyses conducted. RESULTS: This study included 6657 patients. The overall agreement between triage personnel and GPT-4o with the gold standard was nearly perfect (Cohen's kappa = 0.782 and 0.833, respectively). The overall F1 score was 0.863 for the triage team, while GPT-4 achieved an F1 score of 0.897, demonstrating superior performance. ROC curve analysis showed the lowest performance in the yellow zone of a tertiary hospital (AUC = 0.75) and in the red zone of another tertiary hospital (AUC = 0.78). However, overall, AUC values greater than 0.90 were observed, indicating high accuracy. CONCLUSION: ChatGPT generally outperformed triage personnel in patient triage across emergency departments with varying conditions, demonstrating high agreement with the gold standard decision. However, in tertiary hospitals, its performance was relatively lower in triaging patients with more complex symptoms, particularly those requiring triage to the yellow and red zones.",Pasli S; Yadigaroglu M; Kirimli EN; Beser MF; Unutmaz I; Ayhan AO; Karakurt B; Sahin AS; Hicyilmaz HI; Imamoglu M
38596832,Using ChatGPT in Psychiatry to Design Script Concordance Tests in Undergraduate Medical Education: Mixed Methods Study.,2024,JMIR medical education,,,,"BACKGROUND: Undergraduate medical studies represent a wide range of learning opportunities served in the form of various teaching-learning modalities for medical learners. A clinical scenario is frequently used as a modality, followed by multiple-choice and open-ended questions among other learning and teaching methods. As such, script concordance tests (SCTs) can be used to promote a higher level of clinical reasoning. Recent technological developments have made generative artificial intelligence (AI)-based systems such as ChatGPT (OpenAI) available to assist clinician-educators in creating instructional materials. OBJECTIVE: The main objective of this project is to explore how SCTs generated by ChatGPT compared to SCTs produced by clinical experts on 3 major elements: the scenario (stem), clinical questions, and expert opinion. METHODS: This mixed method study evaluated 3 ChatGPT-generated SCTs with 3 expert-created SCTs using a predefined framework. Clinician-educators as well as resident doctors in psychiatry involved in undergraduate medical education in Quebec, Canada, evaluated via a web-based survey the 6 SCTs on 3 criteria: the scenario, clinical questions, and expert opinion. They were also asked to describe the strengths and weaknesses of the SCTs. RESULTS: A total of 102 respondents assessed the SCTs. There were no significant distinctions between the 2 types of SCTs concerning the scenario (P=.84), clinical questions (P=.99), and expert opinion (P=.07), as interpretated by the respondents. Indeed, respondents struggled to differentiate between ChatGPT- and expert-generated SCTs. ChatGPT showcased promise in expediting SCT design, aligning well with Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition criteria, albeit with a tendency toward caricatured scenarios and simplistic content. CONCLUSIONS: This study is the first to concentrate on the design of SCTs supported by AI in a period where medicine is changing swiftly and where technologies generated from AI are expanding much faster. This study suggests that ChatGPT can be a valuable tool in creating educational materials, and further validation is essential to ensure educational efficacy and accuracy.",Hudon A; Kiepura B; Pelletier M; Phan V
38328046,A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge.,2024,bioRxiv : the preprint server for biology,,,,"BACKGROUND: Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways is useful but cannot keep up with the exponential growth of the literature. Large-scale language models (LLMs), notable for their vast parameter sizes and comprehensive training on extensive text corpora, have great potential in automated text mining of biological pathways. METHOD: This study assesses the effectiveness of 21 LLMs, including both API-based models and open-source models. The evaluation focused on two key aspects: gene regulatory relations (specifically, 'activation', 'inhibition', and 'phosphorylation') and KEGG pathway component recognition. The performance of these models was analyzed using statistical metrics such as precision, recall, F1 scores, and the Jaccard similarity index. RESULTS: Our results indicated a significant disparity in model performance. Among the API-based models, ChatGPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged their API-based counterparts, where Falcon-180b-chat and llama1-7b led with the highest performance in gene regulatory relations (F1 of 0.2787 and 0.1923, respectively) and KEGG pathway recognition (Jaccard similarity index of 0.2237 and 0. 2207, respectively). CONCLUSION: LLMs are valuable in biomedical research, especially in gene network analysis and pathway mapping. However, their effectiveness varies, necessitating careful model selection. This work also provided a case study and insight into using LLMs as knowledge graphs.",Azam M; Chen Y; Arowolo MO; Liu H; Popescu M; Xu D
40274037,Leveraging natural language processing to elucidate real-world clinical decision-making paradigms: A proof of concept study.,2025,Journal of biomedical informatics,,,,"BACKGROUND: Understanding how clinicians arrive at decisions in actual practice settings is vital for advancing personalized, evidence-based care. However, systematic analysis of qualitative decision data poses challenges. METHODS: We analyzed transcribed interviews with Hebrew-speaking clinicians on decision processes using natural language processing (NLP). Word frequency and characterized terminology use, while large language models (ChatGPT from OpenAI and Gemini by Google) identified potential cognitive paradigms. RESULTS: Word frequency analysis of clinician interviews identified experience and knowledge as most influential on decision-making. NLP tentatively recognized heuristics-based reasoning grounded in past cases and intuition as dominant cognitive paradigms. Elements of shared decision-making through individualizing care with patients and families were also observed. Limited Hebrew clinical language resources required developing preliminary lexicons and dynamically adjusting stopwords. Findings also provided preliminary support for heuristics guiding clinical judgment while highlighting needs for broader sampling and enhanced analytical frameworks. CONCLUSIONS: This study represents the first use of integrated qualitative and computational methods to systematically elucidate clinical decision-making. Findings supported experience-based heuristics guiding cognition. With methodological enhancements, similar analyses could transform global understanding of tailored care delivery. Standardizing interdisciplinary collaborations on developing NLP tools and analytical frameworks may advance equitable, evidence-based healthcare by elucidating real-world clinical reasoning processes across diverse populations and settings.",Alon Y; Naimi E; Levin C; Videl H; Saban M
37830257,Reshaping medical education: Performance of ChatGPT on a PES medical examination.,2024,Cardiology journal,,,,"BACKGROUND: We are currently experiencing a third digital revolution driven by artificial intelligence (AI), and the emergence of new chat generative pre-trained transformer (ChatGPT) represents a significant technological advancement with profound implications for global society, especially in the field of education. METHODS: The aim of this study was to see how well ChatGPT performed on medical school exams and to highlight how it might change medical education and practice. Recently, OpenAI's ChatGPT (OpenAI, San Francisco; GPT-4 May 24 Version) was put to the test against a significant Polish medical specialization licensing exam (PES), and the results are in. The version of ChatGPT-4 used in this study was the most up-to-date model at the time of publication (GPT-4). ChatGPT answered questions from June 28, 2023, to June 30, 2023. RESULTS: ChatGPT demonstrates notable advancements in natural language processing models on the tasks of medical question answering. In June 2023, the performance of ChatGPT was assessed based on its ability to answer a set of 120 questions, where it achieved a correct response rate of 67.1%, accurately responding to 80 questions. CONCLUSIONS: ChatGPT may be used as an assistance tool in medical education. While ChatGPT can serve as a valuable tool in medical education, it cannot fully replace human expertise and knowledge due to its inherent limitations.",Wojcik S; Rulkiewicz A; Pruszczyk P; Lisik W; Pobozy M; Domienik-Karlowicz J
38564282,"A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions.",2025,Journal of neuro-ophthalmology : the official journal of the North American Neuro-Ophthalmology Society,,,,"BACKGROUND: While large language models (LLMs) are increasingly used in medicine, their effectiveness compared with human experts remains unclear. This study evaluates the quality and empathy of Expert + AI, human experts, and LLM responses in neuro-ophthalmology. METHODS: This randomized, masked, multicenter cross-sectional study was conducted from June to July 2023. We randomly assigned 21 neuro-ophthalmology questions to 13 experts. Each expert provided an answer and then edited a ChatGPT-4-generated response, timing both tasks. In addition, 5 LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, Bard) generated responses. Anonymized and randomized responses from Expert + AI, human experts, and LLMs were evaluated by the remaining 12 experts. The main outcome was the mean score for quality and empathy, rated on a 1-5 scale. RESULTS: Significant differences existed between response types for both quality and empathy ( P < 0.0001, P < 0.0001). For quality, Expert + AI (4.16 +/- 0.81) performed the best, followed by GPT-4 (4.04 +/- 0.92), GPT-3.5 (3.99 +/- 0.87), Claude (3.6 +/- 1.09), Expert (3.56 +/- 1.01), Bard (3.5 +/- 1.15), and Bing (3.04 +/- 1.12). For empathy, Expert + AI (3.63 +/- 0.87) had the highest score, followed by GPT-4 (3.6 +/- 0.88), Bard (3.54 +/- 0.89), GPT-3.5 (3.5 +/- 0.83), Bing (3.27 +/- 1.03), Expert (3.26 +/- 1.08), and Claude (3.11 +/- 0.78). For quality ( P < 0.0001) and empathy ( P = 0.002), Expert + AI performed better than Expert. Time taken for expert-created and expert-edited LLM responses was similar ( P = 0.75). CONCLUSIONS: Expert-edited LLM responses had the highest expert-determined ratings of quality and empathy warranting further exploration of their potential benefits in clinical settings.",Tailor PD; Dalvin LA; Starr MR; Tajfirouz DA; Chodnicki KD; Brodsky MC; Mansukhani SA; Moss HE; Lai KE; Ko MW; Mackay DD; Di Nome MA; Dumitrascu OM; Pless ML; Eggenberger ER; Chen JJ
39910559,AI-based medical ethics education: examining the potential of large language models as a tool for virtue cultivation.,2025,BMC medical education,,,,"BACKGROUND: With artificial intelligence (AI) increasingly revolutionising medicine, this study critically evaluates the integration of large language models (LLMs), known for advanced text processing and generation capabilities, in medical ethics education, focusing on promoting virtue. Positing LLMs as central to mimicking nuanced human communication, it examines their use in medical education and the ethicality of embedding AI in such contexts. METHOD: Using a hybrid approach that combines principlist and non-principlist methodologies, we position LLMs as exemplars and advisors. RESULTS: We discuss the imperative for including AI ethics in medical curricula and its utility as an educational tool, identify the lack of educational resources in medical ethics education, and advocate for future LLMs to mitigate this problem as a ""second-best"" tool. We also emphasise the critical importance of instilling virtue in medical ethics education and illustrate how LLMs can effectively impart moral knowledge and model virtue cultivation. We address expected counter-arguments to using LLMs in this area and explain their profound potential to enrich medical ethics education, including facilitating the acquisition of moral knowledge and developing ethically grounded practitioners. CONCLUSIONS: The study involved a comprehensive exploration of the function of LLMs in medical ethics education, positing that tools such as ChatGPT can profoundly enhance the learning experience in the future. This is achieved through tailored, interactive educational encounters while addressing the ethical nuances of their use in educational settings.",Okamoto S; Kataoka M; Itano M; Sawai T
37750374,GPTZero Performance in Identifying Artificial Intelligence-Generated Medical Texts: A Preliminary Study.,2023,Journal of Korean medical science,,,,"BACKGROUND: With emergence of chatbots to help authors with scientific writings, editors should have tools to identify artificial intelligence-generated texts. GPTZero is among the first websites that has sought media attention claiming to differentiate machine-generated from human-written texts. METHODS: Using 20 text pieces generated by ChatGPT in response to arbitrary questions on various topics in medicine and 30 pieces chosen from previously published medical articles, the performance of GPTZero was assessed. RESULTS: GPTZero had a sensitivity of 0.65 (95% confidence interval, 0.41-0.85); specificity, 0.90 (0.73-0.98); accuracy, 0.80 (0.66-0.90); and positive and negative likelihood ratios, 6.5 (2.1-19.9) and 0.4 (0.2-0.7), respectively. CONCLUSION: GPTZero has a low false-positive (classifying a human-written text as machine-generated) and a high false-negative rate (classifying a machine-generated text as human-written).",Habibzadeh F
38970829,Comparative outcomes of AI-assisted ChatGPT and face-to-face consultations in infertility patients: a cross-sectional study.,2024,Postgraduate medical journal,,,,"BACKGROUND: With the advent of artificial intelligence (AI) in healthcare, digital platforms like ChatGPT offer innovative alternatives to traditional medical consultations. This study seeks to understand the comparative outcomes of AI-assisted ChatGPT consultations and conventional face-to-face interactions among infertility patients. METHODS: A cross-sectional study was conducted involving 120 infertility patients, split evenly between those consulting via ChatGPT and traditional face-to-face methods. The primary outcomes assessed were patient satisfaction, understanding, and consultation duration. Secondary outcomes included demographic information, clinical history, and subsequent actions post-consultation. RESULTS: While both consultation methods had a median age of 34 years, patients using ChatGPT reported significantly higher satisfaction levels (median 4 out of 5) compared to face-to-face consultations (median 3 out of 5; p < 0.001). The ChatGPT group also experienced shorter consultation durations, with a median difference of 12.5 minutes (p < 0.001). However, understanding, demographic distributions, and subsequent actions post-consultation were comparable between the two groups. CONCLUSIONS: AI-assisted ChatGPT consultations offer a promising alternative to traditional face-to-face consultations in assisted reproductive medicine. While patient satisfaction was higher and consultation durations were shorter with ChatGPT, further studies are required to understand the long-term implications and clinical outcomes associated with AI-driven medical consultations. Key Messages What is already known on this topic:  Artificial intelligence (AI) applications, such as ChatGPT, have shown potential in various healthcare settings, including primary care and mental health support. Infertility is a significant global health issue that requires extensive consultations, often facing challenges such as long waiting times and varied patient satisfaction. Previous studies suggest that AI can offer personalized care and immediate feedback, but its efficacy compared with traditional consultations in reproductive medicine was not well-studied. What this study adds:  This study demonstrates that AI-assisted ChatGPT consultations result in significantly higher patient satisfaction and shorter consultation durations compared with traditional face-to-face consultations among infertility patients. Both consultation methods were comparable in terms of patient understanding, demographic distributions, and subsequent actions postconsultation. How this study might affect research, practice, or policy:  The findings suggest that AI-driven consultations could serve as an effective and efficient alternative to traditional methods, potentially reducing consultation times and improving patient satisfaction in reproductive medicine. Further research could explore the long-term impacts and broader applications of AI in clinical settings, influencing future healthcare practices and policies toward integrating AI technologies.",Cheng S; Xiao Y; Liu L; Sun X
39933171,Smart Pharmaceutical Monitoring System With Personalized Medication Schedules and Self-Management Programs for Patients With Diabetes: Development and Evaluation Study.,2025,Journal of medical Internet research,,,,"BACKGROUND: With the climbing incidence of type 2 diabetes, the health care system is under pressure to manage patients with this condition properly. Particularly, pharmacological therapy constitutes the most fundamental means of controlling blood glucose levels and preventing the progression of complications. However, its effectiveness is often hindered by factors such as treatment complexity, polypharmacy, and poor patient adherence. As new technologies, artificial intelligence and digital technologies are covering all aspects of the medical and health care field, but their application and evaluation in the domain of diabetes research remain limited. OBJECTIVE: This study aims to develop and establish a stand-alone diabetes management service system designed to enhance self-management support for patients, as well as to assess its performance with experienced health care professionals. METHODS: Diabetes Universal Medication Schedule (DUMS) system is grounded in official medicine instructions and evidence-based data to establish medication constraints and drug-drug interaction profiles. Individualized medication schedules and self-management programs were generated based on patient-specific conditions and needs, using an app framework to build patient-side contact pathways. The system's ability to provide medication guidance and health management was assessed by senior health care professionals using a 5-point Likert scale across 3 groups: outputs generated by the system (DUMS group), outputs refined by pharmacists (intervention group), and outputs generated by ChatGPT-4 (GPT-4 group). RESULTS: We constructed a cloud-based drug information management system loaded with 475 diabetes treatment-related medications; 684 medication constraints; and 12,351 drug-drug interactions and theoretical supports. The generated personalized medication plan and self-management program included recommended dosing times, disease education, dietary considerations, and lifestyle recommendations to help patients with diabetes achieve correct medication use and active disease management. Reliability analysis demonstrated that the DUMS group outperformed the GPT-4 group in medication schedule accuracy and safety, as well as comprehensiveness and richness of the self-management program (P<.001). The intervention group outperformed the DUMS and GPT-4 groups on all indicator scores. CONCLUSIONS: DUMS's treatment monitoring service can provide reliable self-management support for patients with diabetes. ChatGPT-4, powered by artificial intelligence, can act as a collaborative assistant to health care professionals in clinical contexts, although its performance still requires further training and optimization.",Xiao J; Li M; Cai R; Huang H; Yu H; Huang L; Li J; Yu T; Zhang J; Cheng S
37115245,[Artificial intelligence: How will ChatGPT and other AI applications change our everyday medical practice?].,2023,"Medizinische Klinik, Intensivmedizin und Notfallmedizin",,,,"BACKGROUND: With the free provision of the chat robot ""ChatGPT"" by the company OpenAI in November 2022, an application of artificial intelligence (AI) became tangible for everyone. OBJECTIVES: An explanation of the basic functionality of large language models (LLM) is given, followed by a presentation of application options of ChatGPT in medicine, and an outlook and discussion of possible dangers of AI applications. METHODS: Problem solving with ChatGPT using concrete examples. Analysis and discussion of the available scientific literature. RESULTS: There has been a significant increase in the use of AI applications in scientific work, especially in scientific writing. Wide application of LLM in writing medical documentation is conceivable. Technical functionality allows the use of AI applications as a diagnostic support system. There is a risk of spreading and entrenching inaccuracies and bias through application of LLM. Regulation of this new technology is pending. CONCLUSION: AI applications such as ChatGPT have the potential to permanently change everyday medical practice. An examination of this technology and evaluation of opportunities and risks is warranted.",Sonntagbauer M; Haar M; Kluge S
37770637,Exploring the Potential of ChatGPT-4 in Responding to Common Questions About Abdominoplasty: An AI-Based Case Study of a Plastic Surgery Consultation.,2024,Aesthetic plastic surgery,,,,"BACKGROUND: With the increasing integration of artificial intelligence (AI) in health care, AI chatbots like ChatGPT-4 are being used to deliver health information. OBJECTIVES: This study aimed to assess the capability of ChatGPT-4 in answering common questions related to abdominoplasty, evaluating its potential as an adjunctive tool in patient education and preoperative consultation. METHODS: A variety of common questions about abdominoplasty were submitted to ChatGPT-4. These questions were sourced from a question list provided by the American Society of Plastic Surgery to ensure their relevance and comprehensiveness. An experienced plastic surgeon meticulously evaluated the responses generated by ChatGPT-4 in terms of informational depth, response articulation, and competency to determine the proficiency of the AI in providing patient-centered information. RESULTS: The study showed that ChatGPT-4 can give clear answers, making it useful for answering common queries. However, it struggled with personalized advice and sometimes provided incorrect or outdated references. Overall, ChatGPT-4 can effectively share abdominoplasty information, which may help patients better understand the procedure. Despite these positive findings, the AI needs more refinement, especially in providing personalized and accurate information, to fully meet patient education needs in plastic surgery. CONCLUSIONS: Although ChatGPT-4 shows promise as a resource for patient education, continuous improvements and rigorous checks are essential for its beneficial integration into healthcare settings. The study emphasizes the need for further research, particularly focused on improving the personalization and accuracy of AI responses. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Li W; Chen J; Chen F; Liang J; Yu H
37615692,[Large language models such as ChatGPT and GPT-4 for patient-centered care in radiology].,2023,"Radiologie (Heidelberg, Germany)",,,,"BACKGROUND: With the introduction of ChatGPT in late November 2022, large language models based on artificial intelligence have gained worldwide recognition. These language models are trained on vast amounts of data, enabling them to process complex tasks in seconds and provide detailed, high-level text-based responses. OBJECTIVE: To provide an overview of the most widely discussed large language models, ChatGPT and GPT‑4, with a focus on potential applications for patient-centered radiology. MATERIALS AND METHODS: A PubMed search of both large language models was performed using the terms ""ChatGPT"" and ""GPT-4"", with subjective selection and completion in the form of a narrative review. RESULTS: The generic nature of language models holds great promise for radiology, enabling both patients and referrers to facilitate understanding of radiological findings, overcome language barriers, and improve the quality of informed consent discussions. This could represent a significant step towards patient-centered or person-centered radiology. CONCLUSION: Large language models represent a promising tool for improving the communication of findings, interdisciplinary collaboration, and workflow in radiology. However, important privacy issues and the reliable applicability of these models in medicine remain to be addressed.",Fink MA
39622707,Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain.,2024,JMIR medical education,,,,"BACKGROUND: With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. OBJECTIVE: This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI's capabilities and limitations in medical knowledge. METHODS: We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom's Taxonomy. Statistical analysis of performance, including the kappa coefficient for response consistency, was performed. RESULTS: Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher kappa coefficient of 0.73, compared to ChatGPT-3.5's coefficient of 0.69. CONCLUSIONS: This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4's performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment.",Ros-Arlanzon P; Perez-Sempere A
39868292,"CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research: A novel question-and-answer benchmark designed to assess Large Language Models' comprehension of biomedical research, piloted on Neurodegenerative Diseases.",2025,bioRxiv : the preprint server for biology,,,,"BACKGROUNDS: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain. METHODS: We present CARDBiomedBench, a novel question-and-answer benchmark for evaluating LLMs in biomedical research. For our pilot implementation, we focus on neurodegenerative diseases (NDDs), a domain requiring integration of genetic, molecular, and clinical knowledge. The benchmark combines expert-annotated question-answer (Q/A) pairs with semi-automated data augmentation, drawing from authoritative public resources including drug development data, genome-wide association studies (GWAS), and Summary-data based Mendelian Randomization (SMR) analyses. We evaluated seven private and open-source LLMs across ten biological categories and nine reasoning skills, using novel metrics to assess both response quality and safety. RESULTS: Our benchmark comprises over 68,000 Q/A pairs, enabling robust evaluation of LLM performance. Current state-of-the-art models show significant limitations: models like Claude-3.5-Sonnet demonstrates excessive caution (Response Quality Rate: 25% [95% CI: 25% +/- 1], Safety Rate: 76% +/- 1), while others like ChatGPT-4o exhibits both poor accuracy and unsafe behavior (Response Quality Rate: 37% +/- 1, Safety Rate: 31% +/- 1). These findings reveal fundamental gaps in LLMs' ability to handle complex biomedical information. CONCLUSION: CARDBiomedBench establishes a rigorous standard for assessing LLM capabilities in biomedical research. Our pilot evaluation in the NDD domain reveals critical limitations in current models' ability to safely and accurately process complex scientific information. Future iterations will expand to other biomedical domains, supporting the development of more reliable AI systems for accelerating scientific discovery.",Bianchi O; Willey M; Alvarado CX; Danek B; Khani M; Kuznetsov N; Dadu A; Shah S; Koretsky MJ; Makarious MB; Weller C; Levine KS; Kim S; Jarreau P; Vitale D; Marsan E; Iwaki H; Leonard H; Bandres-Ciga S; Singleton AB; Nalls MA; Mokhtari S; Khashabi D; Faghri F
39289734,"Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis.",2024,Journal of orthopaedic surgery and research,,,,"BACKGROUNDS: The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making. METHODS: We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 </= TS </= 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses. RESULTS: In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before. CONCLUSIONS: Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.",Tong L; Zhang C; Liu R; Yang J; Sun Z
38794965,"Evaluating the Accuracy, Comprehensiveness, and Validity of ChatGPT Compared to Evidence-Based Sources Regarding Common Surgical Conditions: Surgeons' Perspectives.",2025,The American surgeon,,,,"BackgroundThis study aims to assess the accuracy, comprehensiveness, and validity of ChatGPT compared to evidence-based sources regarding the diagnosis and management of common surgical conditions by surveying the perceptions of U.S. board-certified practicing surgeons.MethodsAn anonymous cross-sectional survey was distributed to U.S. practicing surgeons from June 2023 to March 2024. The survey comprised 94 multiple-choice questions evaluating diagnostic and management information for five common surgical conditions from evidence-based sources or generated by ChatGPT. Statistical analysis included descriptive statistics and paired-sample t-tests.ResultsParticipating surgeons were primarily aged 40-50 years (43%), male (86%), White (57%), and had 5-10 years or >15 years of experience (86%). The majority of surgeons had no prior experience with ChatGPT in surgical practice (86%). For material discussing both acute cholecystitis and upper gastrointestinal hemorrhage, evidence-based sources were rated as significantly more comprehensive (3.57 (+/-.535) vs 2.00 (+/-1.16), P = .025) (4.14 (+/-.69) vs 2.43 (+/-.98), P < .001) and valid (3.71 (+/-.488) vs 2.86 (+/-1.07), P = .045) (3.71 (+/-.76) vs 2.71 (+/-.95) P = .038) than ChatGPT. However, there was no significant difference in accuracy between the two sources (3.71 vs 3.29, P = .289) (3.57 vs 2.71, P = .111).ConclusionSurveyed U.S. board-certified practicing surgeons rated evidence-based sources as significantly more comprehensive and valid compared to ChatGPT across the majority of surveyed surgical conditions. However, there was no significant difference in accuracy between the sources across the majority of surveyed conditions. While ChatGPT may offer potential benefits in surgical practice, further refinement and validation are necessary to enhance its utility and acceptance among surgeons.",Nasef H; Patel H; Amin Q; Baum S; Ratnasekera A; Ang D; Havron WS; Nakayama D; Elkbuli A
37115365,Exploring the Potential of GPT-4 in Biomedical Engineering: The Dawn of a New Era.,2023,Annals of biomedical engineering,,,,"Biomedical engineering is a relatively young interdisciplinary field based on engineering, biology, and medicine. Of note, the rapid progress of artificial intelligence (AI)-based technologies has made a significant impact on the biomedical engineering field, and continuously bring innovations and breakthroughs. Recently, ChatGPT, an AI chatbot developed by OpenAI company, has gained tremendous attention due to its powerful natural language generation and understanding ability. In this study, we explored potential of GPT-4 in the eight branches of biomedical engineering including medical imaging, medical devices, bioinformatics, biomaterials, biomechanics, gene and cell engineering, tissue engineering, and neural engineering. Our results show that the application of GPT-4 will bring new opportunities for the development of this field.",Cheng K; Guo Q; He Y; Lu Y; Gu S; Wu H
38306900,PubMed and beyond: biomedical literature search in the age of artificial intelligence.,2024,EBioMedicine,,,,"Biomedical research yields vast information, much of which is only accessible through the literature. Consequently, literature search is crucial for healthcare and biomedicine. Recent improvements in artificial intelligence (AI) have expanded functionality beyond keywords, but they might be unfamiliar to clinicians and researchers. In response, we present an overview of over 30 literature search tools tailored to common biomedical use cases, aiming at helping readers efficiently fulfill their information needs. We first discuss recent improvements and continued challenges of the widely used PubMed. Then, we describe AI-based literature search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendation. 5. Discovering hidden associations through literature mining. Finally, we discuss the impacts of recent developments of large language models such as ChatGPT on biomedical information seeking.",Jin Q; Leaman R; Lu Z
39518681,"Harnessing the Power of ChatGPT in Cardiovascular Medicine: Innovations, Challenges, and Future Directions.",2024,Journal of clinical medicine,,,,"Cardiovascular diseases remain the leading cause of morbidity and mortality globally, posing significant challenges to public health. The rapid evolution of artificial intelligence (AI), particularly with large language models such as ChatGPT, has introduced transformative possibilities in cardiovascular medicine. This review examines ChatGPT's broad applications in enhancing clinical decision-making-covering symptom analysis, risk assessment, and differential diagnosis; advancing medical education for both healthcare professionals and patients; and supporting research and academic communication. Key challenges associated with ChatGPT, including potential inaccuracies, ethical considerations, data privacy concerns, and inherent biases, are discussed. Future directions emphasize improving training data quality, developing specialized models, refining AI technology, and establishing regulatory frameworks to enhance ChatGPT's clinical utility and mitigate associated risks. As cardiovascular medicine embraces AI, ChatGPT stands out as a powerful tool with substantial potential to improve therapeutic outcomes, elevate care quality, and advance research innovation. Fully understanding and harnessing this potential is essential for the future of cardiovascular health.",Leon M; Ruaengsri C; Pelletier G; Bethencourt D; Shibata M; Flores MQ; Shudo Y
37973369,Assessing the performance of ChatGPT in bioethics: a large language model's moral compass in medicine.,2024,Journal of medical ethics,,,,"Chat Generative Pre-Trained Transformer (ChatGPT) has been a growing point of interest in medical education yet has not been assessed in the field of bioethics. This study evaluated the accuracy of ChatGPT-3.5 (April 2023 version) in answering text-based, multiple choice bioethics questions at the level of US third-year and fourth-year medical students. A total of 114 bioethical questions were identified from the widely utilised question banks UWorld and AMBOSS. Accuracy, bioethical categories, difficulty levels, specialty data, error analysis and character count were analysed. We found that ChatGPT had an accuracy of 59.6%, with greater accuracy in topics surrounding death and patient-physician relationships and performed poorly on questions pertaining to informed consent. Of all the specialties, it performed best in paediatrics. Yet, certain specialties and bioethical categories were under-represented. Among the errors made, it tended towards content errors and application errors. There were no significant associations between character count and accuracy. Nevertheless, this investigation contributes to the ongoing dialogue on artificial intelligence's (AI) role in healthcare and medical education, advocating for further research to fully understand AI systems' capabilities and constraints in the nuanced field of medical bioethics.",Chen J; Cadiente A; Kasselman LJ; Pilkington B
39306288,Evaluating Performance of ChatGPT on MKSAP Cardiology Board Review Questions.,2024,International journal of cardiology,,,,"Chat Generative Pretrained Transformer (ChatGPT) is a natural language processing tool created by OpenAI. Much of the discussion regarding artificial intelligence (AI) in medicine is the ability of the language to enhance medical practice, improve efficiency and decrease errors. The objective of this study was to analyze the ability of ChatGPT to answer board-style cardiovascular medicine questions by using the Medical Knowledge Self-Assessment Program (MKSAP).The study evaluated the performance of ChatGPT (versions 3.5 and 4), alongside internal medicine residents and internal medicine and cardiology attendings, in answering 98 multiple-choice questions (MCQs) from the Cardiovascular Medicine Chapter of MKSAP. ChatGPT-4 demonstrated an accuracy of 74.5 %, comparable to internal medicine (IM) intern (63.3 %), senior resident (63.3 %), internal medicine attending physician (62.2 %), and ChatGPT-3.5 (64.3 %) but significantly lower than cardiology attending physician (85.7 %). Subcategory analysis revealed no statistical difference between ChatGPT and physicians, except in valvular heart disease where cardiology attending outperformed ChatGPT (p = 0.031) for version 3.5, and for heart failure (p = 0.046) where ChatGPT-4 outperformed senior resident. While ChatGPT shows promise in certain subcategories, in order to establish AI as a reliable educational tool for medical professionals, performance of ChatGPT will likely need to surpass the accuracy of instructors, ideally achieving the near-perfect score on posed questions.",Milutinovic S; Petrovic M; Begosh-Mayne D; Lopez-Mattei J; Chazal RA; Wood MJ; Escarcega RO
37295794,Addition of dexamethasone to prolong peripheral nerve blocks: a ChatGPT-created narrative review.,2024,Regional anesthesia and pain medicine,,,,"Chat Generative Pre-trained Transformer (ChatGPT), an artificial intelligence chatbot, produces detailed responses and human-like coherent answers, and has been used in the clinical and academic medicine. To evaluate its accuracy in regional anesthesia topics, we produced a ChatGPT review on the addition of dexamethasone to prolong peripheral nerve blocks. A group of experts in regional anesthesia and pain medicine were invited to help shape the topic to be studied, refine the questions entered in to the ChatGPT program, vet the manuscript for accuracy, and create a commentary on the article. Although ChatGPT produced an adequate summary of the topic for a general medical or lay audience, the review that were created appeared to be inadequate for a subspecialty audience as the expert authors. Major concerns raised by the authors included the poor search methodology, poor organization/lack of flow, inaccuracies/omissions of text or references, and lack of novelty. At this time, we do not believe ChatGPT is able to replace human experts and is extremely limited in providing original, creative solutions/ideas and interpreting data for a subspecialty medical review article.",Wu CL; Cho B; Gabriel R; Hurley R; Liu J; Mariano ER; Mathur V; Memtsoudis SG; Grant MC
38156230,All aboard the ChatGPT steamroller: Top 10 ways to make artificial intelligence work for healthcare professionals.,2023,Antimicrobial stewardship & healthcare epidemiology : ASHE,,,,"Chat Generative Pre-trained Transformer (ChatGPT), the flagship generative artificial intelligence (AI) chatbot by OpenAI, is transforming many things in medicine, from healthcare and research to medical education. It is anticipated to integrate in many aspects of the medical industry, and we should brace for this inevitability and use it to our advantage. Here are proposed ways you can use ChatGPT in medicine with some specific use cases in antimicrobial stewardship and hospital epidemiology.",Non LR
39259341,Analysis of Responses of GPT-4 V to the Japanese National Clinical Engineer Licensing Examination.,2024,Journal of medical systems,,,,"Chat Generative Pretrained Transformer (ChatGPT; OpenAI) is a state-of-the-art large language model that can simulate human-like conversations based on user input. We evaluated the performance of GPT-4 V in the Japanese National Clinical Engineer Licensing Examination using 2,155 questions from 2012 to 2023. The average correct answer rate for all questions was 86.0%. In particular, clinical medicine, basic medicine, medical materials, biological properties, and mechanical engineering achieved a correct response rate of >/= 90%. Conversely, medical device safety management, electrical and electronic engineering, and extracorporeal circulation obtained low correct answer rates ranging from 64.8% to 76.5%. The correct answer rates for questions that included figures/tables, required numerical calculation, figure/table  intersection calculation, and knowledge of Japanese Industrial Standards were 55.2%, 85.8%, 64.2% and 31.0%, respectively. The reason for the low correct answer rates is that ChatGPT lacked recognition of the images and knowledge of standards and laws. This study concludes that careful attention is required when using ChatGPT because several of its explanations lack the correct description.",Ishida K; Arisaka N; Fujii K
37252576,A Case Study Demonstrating Applications of ChatGPT in the Clinical Management of Treatment-Resistant Schizophrenia.,2023,Cureus,,,,"Chat Generative Pre-trained Transformer, also known as ChatGPT, is a new artificial intelligence (AI) program that responds to user inquiry with discourse resembling human language. The range of ChatGPT capabilities caught the interest of the medical world after it demonstrated its ability to pass medical boards examinations. In this case report, we present the clinical treatment of a 22-year-old male diagnosed with treatment-resistant schizophrenia (TRS) and compare the medical management suggested by ChatGPT to current standards of care in order to assess the program's ability to identify the disorder, evaluate potential medical and psychiatric work-up, and develop a treatment plan addressing the distinct nuances of our patient. In our inquiry with ChatGPT, we found that it can accurately identify our patient as having TRS and order appropriate tests to methodically rule out alternative causes of acute psychosis. Furthermore, the AI program suggests pharmacologic treatment options including clozapine with adjuvant medications, and nonpharmacologic treatment options including electroconvulsive therapy (ECT), repetitive transcranial magnetic stimulation (rTMS), and psychotherapy which align with current standards of care. Lastly, ChatGPT provides a comprehensive list of side effects associated with antipsychotics and mood stabilizers used to treat TRS. We found both potential for and limitations in the clinical application of ChatGPT to assist in the assessment and management of complex medical conditions. Overall, ChatGPT may serve as a powerful tool to organize medical data in a meaningful and palatable format for medical professionals to reference during patient care.",Galido PV; Butala S; Chakerian M; Agustines D
37965416,Ectopic Pituitary Neuroendocrine Tumor: A Case Report Written With the Help of ChatGPT.,2023,Cureus,,,,"ChatGPT (Chat Generative Pre-trained Transformer) is a chat robot program developed by OpenAI (OpenAI Incorporated, Mission District, San Francisco, USA). It is a natural-language processing tool driven by artificial intelligence technology. It can complete tasks such as writing emails, video scripts, copywriting, translation, coding, and writing papers. At present, several studies have been conducted on the use of this tool for academic writing, so we report a rare case of suprasellar ectopic pituitary neuroendocrine tumor (PitNET) secreting growth hormones, in a female 53-year-old patient with facial changes and hypertrophy of her hands and feet for six years. When asking questions on unusual topics, ChatGPT can provide answers but not specific ones, and similar to most machine learning solutions, verifying the correctness of information is a problem. Question asked from ChatGPT: Summarize ectopic pituitary neuroendocrine tumor. ""Suprasellar heterotopic pituitary neuroendocrine tumors refer to tumors formed by the heterotopic growth of pituitary neuroendocrine tumor cells in the sellar region other than the pituitary glands. This tumor is common in middle-aged women, and it often manifests as headache, visual impairment, and endocrine dysfunction. Diagnosis is usually made by combining history, physical examination, blood tests, and imaging. Treatment is primarily performed through surgical resection of the tumor, radiotherapy, and drug therapy, and the prognosis is mostly good."" The content in quotation marks is generated by ChatGPT.",Li J; Zhong J; Li Z; Xiao Y; Wang S
37465809,When Precision Meets Penmanship: ChatGPT and Surgery Documentation.,2023,Cureus,,,,"ChatGPT (Chatbot Generative Pre-Trained Transformer) is an artificial intelligence with several potential applications in the field of medicine. As a large language model, it is particularly good at generating text. This study investigates the use of ChatGPT in constructing operation notes for laparoscopic appendicectomy, one of the most common surgical procedures in the UK. We prompted ChatGPT-4, the latest generation of ChatGPT, to produce operation notes for laparoscopic appendicectomy, which were then evaluated against 'Getting It Right First Time' (GIRFT) recommendations. GIRFT is an organisation that has collaborated with the National Health Service (NHS) to improve surgical documentation guidelines. Excluding certain items documented elsewhere in patient records, the generated notes were assessed against 30 key points in GIRFT recommendations. This process was repeated three times to obtain an average score. Our results showed that ChatGPT generated operation notes in seconds, with an average coverage of 78.8% (23.66 out of 30 points) of the GIRFT guidelines, surpassing average compliance with similar guidelines from the Royal College of Surgeons (RCS). However, the quality of ChatGPT's output was found to be dependent on the quality of the prompt, highlighting the need for verification of the generated content. Additionally, secure integration with electronic health records is required before ChatGPT can be adopted into the NHS.",Robinson A; Aggarwal S Jr
38281582,ChatGPT in maternal-fetal medicine practice: a primer for clinicians.,2024,American journal of obstetrics & gynecology MFM,,,,"ChatGPT (Generative Pre-trained Transformer), a language model that was developed by OpenAI and launched in November 2022, generates human-like responses to prompts using deep-learning technology. The integration of large language processing models into healthcare has the potential to improve the accessibility of medical information for both patients and health professionals alike. In this commentary, we demonstrated the ability of ChatGPT to produce patient information sheets. Four board-certified, maternal-fetal medicine attending physicians rated the accuracy and humanness of the information according to 2 predefined scales of accuracy and completeness. The median score for accuracy of information was rated 4.8 on a 6-point scale and the median score for completeness of information was 2.2 on a 3-point scale for the 5 patient information leaflets generated by ChatGPT. Concerns raised included the omission of clinically important information for patient counseling in some patient information leaflets and the inability to verify the source of information because ChatGPT does not provide references. ChatGPT is a powerful tool that has the potential to enhance patient care, but such a tool requires extensive validation and is perhaps best considered as an adjunct to clinical practice rather than as a tool to be used freely by the public for healthcare information.",Horgan R; Martins JG; Saade G; Abuhamad A; Kawakita T
37433672,A Conversation with ChatGPT.,2023,Journal of nuclear medicine technology,,,,ChatGPT chatbot powered by GPT 3.5 was released in late November 2022 but has been rapidly assimilated into educational and clinical environments. Method: Insight into ChatGPT capabilities was undertaken in an interview-style approach with the chatbot itself. Results: ChatGPT powered by GPT 3.5 exudes confidence in its capabilities in supporting and enhancing student learning in nuclear medicine and in supporting clinical practice. ChatGPT is also self-aware of limitations and flaws in capabilities and the risks these pose to academic integrity. Conclusion: Further objective evaluation of ChatGPT capabilities in authentic learning and clinical scenarios is required.,Currie G
39027317,Developing ChatGPT for biology and medicine: a complete review of biomedical question answering.,2024,Biophysics reports,,,,"ChatGPT explores a strategic blueprint of question answering (QA) to deliver medical diagnoses, treatment recommendations, and other healthcare support. This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms. By transitioning the distribution of text, images, videos, and other modalities from the general domain to the medical domain, these techniques have accelerated the progress of medical domain question answering (MDQA). They bridge the gap between human natural language and sophisticated medical domain knowledge or expert-provided manual annotations, handling large-scale, diverse, unbalanced, or even unlabeled data analysis scenarios in medical contexts. Central to our focus is the utilization of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements. Specialized tasks such as unimodal-related question answering, reading comprehension, reasoning, diagnosis, relation extraction, probability modeling, and others, as well as multimodal-related tasks like vision question answering, image captioning, cross-modal retrieval, report summarization, and generation, are discussed in detail. Each section delves into the intricate specifics of the respective method under consideration. This paper highlights the structures and advancements of medical domain explorations against general domain methods, emphasizing their applications across different tasks and datasets. It also outlines current challenges and opportunities for future medical domain research, paving the way for continued innovation and application in this rapidly evolving field. This comprehensive review serves not only as an academic resource but also delineates the course for future probes and utilization in the field of medical question answering.",Li Q; Li L; Li Y
37061593,Behind the ChatGPT Hype: Are Its Suggestions Contributing to Addiction?,2023,Annals of biomedical engineering,,,,"ChatGPT has been a frequent topic of discussion lately. All over the Internet, from YouTube to blogs, there have been reports about how ChatGPT is able to plan people's daily activities, even for a whole month. However, what matters is what activities ChatGPT recommends. When ChatGPT was trained on a vast amount of data from the Internet, we wondered if it would suggest activities that can lead to addiction. In our test, not once did ChatGPT recommend an activity related to alcohol, drug use, or any other activity that can lead to addiction with serious health consequences. Suggestions seemed more like self-improvement posts on blogs than discussion forums where people might mention drinking in the evenings. Thus, if a person were to use ChatGPT as a personal lifestyle advisor, it does not appear on the basis of this test that ChatGPT would recommend activities that would be fundamentally detrimental to their health. However, more detailed long-term testing of similar tools is needed before recommendations for use in practice can be made.",Haman M; Skolnik M
39811862,Systematic review of ChatGPT accuracy and performance in Iran's medical licensing exams: A brief report.,2024,Journal of education and health promotion,,,,"ChatGPT has demonstrated significant potential in various aspects of medicine, including its performance on licensing examinations. In this study, we systematically investigated ChatGPT's performance in Iranian medical exams and assessed the quality of the included studies using a previously published assessment checklist. The study found that ChatGPT achieved an accuracy range of 32-72% on basic science exams, 34-68.5% on pre-internship exams, and 32-84% on residency exams. Notably, its performance was generally higher when the input was provided in English compared to Persian. One study reported a 40% accuracy rate on an endodontic board exam. To establish ChatGPT as a supplementary tool in medical education and clinical practice, we suggest that dedicated guidelines and checklists are needed to ensure high-quality and consistent research in this emerging field.",Keshtkar A; Atighi F; Reihani H
38866891,In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.,2024,Scientific reports,,,,"ChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT's capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT's overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with r(s) = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = ""what is the most likely/probable cause""). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.",Knoedler L; Knoedler S; Hoch CC; Prantl L; Frank K; Soiderer L; Cotofana S; Dorafshar AH; Schenck T; Vollbach F; Sofo G; Alfertshofer M
40395874,ChatGPT for mechanobiology and medicine: A perspective.,2023,Mechanobiology in medicine,,,,"ChatGPT has garnered significant attention for its impressive capabilities across various domains, including medicine and mechanobiology. In order to facilitate the integration of ChatGPT into research, this paper explores the applications of ChatGPT in these domains, focusing on its usage in (1) reading and writing, (2) retrieval and knowledge management, and (3) computation, simulation, and visualization. Meanwhile, this study acknowledges the limitations and challenges associated with ChatGPT's usage. We investigate the interaction between ChatGPT and external tools in these applications and advocate for the integration of more powerful tools in these research areas into ChatGPT to further expand its potential applications in medicine and mechanobiology.",Chen M; Li G
38029273,The use of ChatGPT in occupational medicine: opportunities and threats.,2023,Annals of occupational and environmental medicine,,,,"ChatGPT has the potential to revolutionize occupational medicine by providing a powerful tool for analyzing data, improving communication, and increasing efficiency. It can help identify patterns and trends in workplace health and safety, act as a virtual assistant for workers, employers, and occupational health professionals, and automate certain tasks. However, caution is required due to ethical concerns, the need to maintain confidentiality, and the risk of inconsistent or inaccurate results. ChatGPT cannot replace the crucial role of the occupational health professional in the medical surveillance of workers and the analysis of data on workers' health.",Sridi C; Brigui S
37142327,Generative artificial intelligence: Can ChatGPT write a quality abstract?,2023,Emergency medicine Australasia : EMA,,,,"ChatGPT is a generative artificial intelligence chatbot which may have a role in medicine and science. We investigated if the freely available version of ChatGPT can produce a quality conference abstract using a fictitious but accurately calculated data table as applied by a non-medically trained person. The resulting abstract was well written without obvious errors and followed the abstract instructions. One of the references was fictitious, known as 'hallucination'. ChatGPT or similar programmes, with careful review of the product by authors, may become a valuable scientific writing tool. The scientific and medical use of generative artificial intelligence, however, raises many questions.",Babl FE; Babl MP
37264670,Performance of ChatGPT on Specialty Certificate Examination in Dermatology multiple-choice questions.,2024,Clinical and experimental dermatology,,,,"ChatGPT is a large language model trained on increasingly large datasets by OpenAI to perform language-based tasks. It is capable of answering multiple-choice questions, such as those posed by the Specialty Certificate Examination (SCE) in Dermatology. We asked two iterations of ChatGPT: ChatGPT-3.5 and ChatGPT-4 84 multiple-choice sample questions from the sample SCE in Dermatology question bank. ChatGPT-3.5 achieved an overall score of 63%, and ChatGPT-4 scored 90% (a significant improvement in performance; P < 0.001). The typical pass mark for the SCE in Dermatology is 70-72%. ChatGPT-4 is therefore capable of answering clinical questions and achieving a passing grade in these sample questions. There are many possible educational and clinical implications for increasingly advanced artificial intelligence (AI) and its use in medicine, including in the diagnosis of dermatological conditions. Such advances should be embraced provided that patient safety is a core tenet, and the limitations of AI in the nuances of complex clinical cases are recognized.",Passby L; Jenko N; Wernham A
38405625,ChatGPT in forensic sciences: a new Pandora's box with advantages and challenges to pay attention.,2023,Forensic sciences research,,,,"ChatGPT is a variant of the generative pre-trained transformer (GPT) language model that uses large amounts of text-based training data and a transformer architecture to generate human-like text adjusted to the received prompts. ChatGPT presents several advantages in forensic sciences, namely, constituting a virtual assistant to aid lawyers, judges, and victims in managing and interpreting forensic expert data. But what would happen if ChatGPT began to be used to produce forensic expertise reports? Despite its potential applications, the use of ChatGPT and other Large Language Models and artificial intelligence tools in forensic writing also poses ethical and legal concerns, which are discussed in this perspective together with some expected future perspectives.",Dinis-Oliveira RJ; Azevedo RMS
38096831,"ChatGPT: opportunities and risks in the fields of medical care, teaching, and research.",2023,Gaceta medica de Mexico,,,,"ChatGPT is a virtual assistant with artificial intelligence (AI) that uses natural language to communicate, i.e., it holds conversations as those that would take place with another human being. It can be applied at all educational levels, including medical education, where it can impact medical training, research, the writing of scientific articles, clinical care, and personalized medicine. It can modify interactions between physicians and patients and thus improve the standards of healthcare quality and safety, for example, by suggesting preventive measures in a patient that sometimes are not considered by the physician for multiple reasons. ChatGPT potential uses in medical education, as a tool to support the writing of scientific articles, as a medical care assistant for patients and doctors for a more personalized medical approach, are some of the applications discussed in this article. Ethical aspects, originality, inappropriate or incorrect content, incorrect citations, cybersecurity, hallucinations, and plagiarism are some examples of situations to be considered when using AI-based tools in medicine.",Gutierrez-Cirlos C; Carrillo-Perez DL; Bermudez-Gonzalez JL; Hidrogo-Montemayor I; Carrillo-Esper R; Sanchez-Mendiola M
38358925,ChatGPT: How Closely Should We Be Watching?,2023,"Journal of insurance medicine (New York, N.Y.)",,,,ChatGPT is about to make major inroads into clinical medicine. This article discusses the pros and cons of its use.,Meagher T
37168166,ChatGPT for Future Medical and Dental Research.,2023,Cureus,,,,"ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI and it first became available to the public in November 2022. ChatGPT can assist in finding academic papers on the web and summarizing them. This chatbot has the potential to be applied in scientific writing, it has the ability to generate automated drafts, summarize articles, and translate content from several languages. This in turn can make academic writing faster and less challenging. However, due to ethical considerations, its use in scientific writing should be regulated and carefully monitored. Few papers have discussed the use of ChatGPT in scientific research writing. This review aims to discuss all the relevant published papers that discuss the use of ChatGPT in medical and dental research.",Fatani B
36981544,"ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.",2023,"Healthcare (Basel, Switzerland)",,,,"ChatGPT is an artificial intelligence (AI)-based conversational large language model (LLM). The potential applications of LLMs in health care education, research, and practice could be promising if the associated valid concerns are proactively examined and addressed. The current systematic review aimed to investigate the utility of ChatGPT in health care education, research, and practice and to highlight its potential limitations. Using the PRIMSA guidelines, a systematic search was conducted to retrieve English records in PubMed/MEDLINE and Google Scholar (published research or preprints) that examined ChatGPT in the context of health care education, research, or practice. A total of 60 records were eligible for inclusion. Benefits of ChatGPT were cited in 51/60 (85.0%) records and included: (1) improved scientific writing and enhancing research equity and versatility; (2) utility in health care research (efficient analysis of datasets, code generation, literature reviews, saving time to focus on experimental design, and drug discovery and development); (3) benefits in health care practice (streamlining the workflow, cost saving, documentation, personalized medicine, and improved health literacy); and (4) benefits in health care education including improved personalized learning and the focus on critical thinking and problem-based learning. Concerns regarding ChatGPT use were stated in 58/60 (96.7%) records including ethical, copyright, transparency, and legal issues, the risk of bias, plagiarism, lack of originality, inaccurate content with risk of hallucination, limited knowledge, incorrect citations, cybersecurity issues, and risk of infodemics. The promising applications of ChatGPT can induce paradigm shifts in health care education, research, and practice. However, the embrace of this AI chatbot should be conducted with extreme caution considering its potential limitations. As it currently stands, ChatGPT does not qualify to be listed as an author in scientific articles unless the ICMJE/COPE guidelines are revised or amended. An initiative involving all stakeholders in health care education, research, and practice is urgently needed. This will help to set a code of ethics to guide the responsible use of ChatGPT among other LLMs in health care and academia.",Sallam M
37705958,A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons.,2023,iScience,,,,"ChatGPT is an artificial intelligence product developed by OpenAI. This study aims to investigate whether ChatGPT can respond in accordance with evidence-based medicine in neurosurgery. We generated 50 neurosurgical questions covering neurosurgical diseases. Each question was posed three times to GPT-3.5 and GPT-4.0. We also recruited three neurosurgeons with high, middle, and low seniority to respond to questions. The results were analyzed regarding ChatGPT's overall performance score, mean scores by the items' specialty classification, and question type. In conclusion, GPT-3.5's ability to respond in accordance with evidence-based medicine was comparable to that of neurosurgeons with low seniority, and GPT-4.0's ability was comparable to that of neurosurgeons with high seniority. Although ChatGPT is yet to be comparable to a neurosurgeon with high seniority, future upgrades could enhance its performance and abilities.",Liu J; Zheng J; Cai X; Wu D; Yin C
37379067,Utility of ChatGPT in Clinical Practice.,2023,Journal of medical Internet research,,,,"ChatGPT is receiving increasing attention and has a variety of application scenarios in clinical practice. In clinical decision support, ChatGPT has been used to generate accurate differential diagnosis lists, support clinical decision-making, optimize clinical decision support, and provide insights for cancer screening decisions. In addition, ChatGPT has been used for intelligent question-answering to provide reliable information about diseases and medical queries. In terms of medical documentation, ChatGPT has proven effective in generating patient clinical letters, radiology reports, medical notes, and discharge summaries, improving efficiency and accuracy for health care providers. Future research directions include real-time monitoring and predictive analytics, precision medicine and personalized treatment, the role of ChatGPT in telemedicine and remote health care, and integration with existing health care systems. Overall, ChatGPT is a valuable tool that complements the expertise of health care providers and improves clinical decision-making and patient care. However, ChatGPT is a double-edged sword. We need to carefully consider and study the benefits and potential dangers of ChatGPT. In this viewpoint, we discuss recent advances in ChatGPT research in clinical practice and suggest possible risks and challenges of using ChatGPT in clinical practice. It will help guide and support future artificial intelligence research similar to ChatGPT in health.",Liu J; Wang C; Liu S
39949509,"Benefits, limits, and risks of ChatGPT in medicine.",2025,Frontiers in artificial intelligence,,,,"ChatGPT represents a transformative technology in healthcare, with demonstrated impacts across clinical practice, medical education, and research. Studies show significant efficiency gains, including 70% reduction in administrative time for discharge summaries and achievement of medical professional-level performance on standardized tests (60% accuracy on USMLE, 78.2% on PubMedQA). ChatGPT offers personalized learning platforms, automated scoring, and instant access to vast medical knowledge in medical education, addressing resource limitations and enhancing training efficiency. It streamlines clinical workflows by supporting triage processes, generating discharge summaries, and alleviating administrative burdens, allowing healthcare professionals to focus more on patient care. Additionally, ChatGPT facilitates remote monitoring and chronic disease management, providing personalized advice, medication reminders, and emotional support, thus bridging gaps between clinical visits. Its ability to process and synthesize vast amounts of data accelerates research workflows, aiding in literature reviews, hypothesis generation, and clinical trial designs. This paper aims to gather and analyze published studies involving ChatGPT, focusing on exploring its advantages and disadvantages within the healthcare context. To aid in understanding and progress, our analysis is organized into six key areas: (1) Information and Education, (2) Triage and Symptom Assessment, (3) Remote Monitoring and Support, (4) Mental Healthcare Assistance, (5) Research and Decision Support, and (6) Language Translation. Realizing ChatGPT's full potential in healthcare requires addressing key limitations, such as its lack of clinical experience, inability to process visual data, and absence of emotional intelligence. Ethical, privacy, and regulatory challenges further complicate its integration. Future improvements should focus on enhancing accuracy, developing multimodal AI models, improving empathy through sentiment analysis, and safeguarding against artificial hallucination. While not a replacement for healthcare professionals, ChatGPT can serve as a powerful assistant, augmenting their expertise to improve efficiency, accessibility, and quality of care. This collaboration ensures responsible adoption of AI in transforming healthcare delivery. While ChatGPT demonstrates significant potential in healthcare transformation, systematic evaluation of its implementation across different healthcare settings reveals varying levels of evidence quality-from robust randomized trials in medical education to preliminary observational studies in clinical practice. This heterogeneity in evidence quality necessitates a structured approach to future research and implementation.",Tangsrivimol JA; Darzidehkalani E; Virk HUH; Wang Z; Egger J; Wang M; Hacking S; Glicksberg BS; Strauss M; Krittanawong C
37523010,[Big hype about ChapGPT in medicine : Is it something for rhythmologists? What must be taken into consideration?].,2023,Herzschrittmachertherapie & Elektrophysiologie,,,,"ChatGPT, a chatbot based on a large language model, is currently attracting much attention. Modern machine learning (ML) architectures enable the program to answer almost any question, to summarize, translate, and even generate its own texts, all in a text-based dialogue with the user. Underlying technologies, summarized under the acronym NLP (natural language processing), go back to the 1960s. In almost all areas including medicine, ChatGPT is raising enormous hopes. It can easily pass medical exams and may be useful in patient care, diagnostic and therapeutic assistance, and medical research. The enthusiasm for this new technology shown even by medical professionals is surprising. Although the system knows much, it does not know everything; not everything it outputs is accurate either. Every output has to be carefully checked by the user for correctness, which is often not easily done since references to sources are lacking. Issues regarding data protection and ethics also arise. Today's language models are not free of bias and systematic distortion. These shortcomings have led to calls for stronger regulation of the use of ChatGPT and an increasing number of similar language models. However, this new technology represents an enormous progress in knowledge processing and dissemination. Numerous scenarios in which ChatGPT can provide assistance are conceivable, including in rhythmology. In the future, it will be crucial to render the models error-free and transparent and to clearly define the rules for their use. Responsible use requires systematic training to improve the digital competence of users, including physicians who use such programs.",Haverkamp W; Strodthoff N; Tennenbaum J; Israel C
38261307,"Large language model, AI and scientific research: why ChatGPT is only the beginning.",2024,Journal of neurosurgical sciences,,,,"ChatGPT, a conversational artificial intelligence model based on the generative pre-trained transformer GPT architecture, has garnered widespread attention due to its user-friendly nature and diverse capabilities. This technology enables users of all backgrounds to effortlessly engage in human-like conversations and receive coherent and intelligible responses. Beyond casual interactions, ChatGPT offers compelling prospects for scientific research, facilitating tasks like literature review and content summarization, ultimately expediting and enhancing the academic writing process. Still, in the field of medicine and surgery, it has already shown its endless potential in many tasks (enhancing decision-making processes, aiding in surgical planning and simulation, providing real-time assistance during surgery, improving postoperative care and rehabilitation, contributing to training, education, research, and development). However, it is crucial to acknowledge the model's limitations, encompassing knowledge constraints and the potential for erroneous responses, as well as ethical and legal considerations. This paper explores the potential benefits and pitfalls of these innovative technologies in scientific research, shedding light on their transformative impact while addressing concerns surrounding their use.",Zangrossi P; Martini M; Guerrini F; DE Bonis P; Spena G
39383119,ChatGPT M.D.: Is there any room for generative AI in neurology?,2024,PloS one,,,,"ChatGPT, a general artificial intelligence, has been recognized as a powerful tool in scientific writing and programming but its use as a medical tool is largely overlooked. The general accessibility, rapid response time and comprehensive training database might enable ChatGPT to serve as a diagnostic augmentation tool in certain clinical settings. The diagnostic process in neurology is often challenging and complex. In certain time-sensitive scenarios, rapid evaluation and diagnostic decisions are needed, while in other cases clinicians are faced with rare disorders and atypical disease manifestations. Due to these factors, the diagnostic accuracy in neurology is often suboptimal. Here we evaluated whether ChatGPT can be utilized as a valuable and innovative diagnostic augmentation tool in various neurological settings. We used synthetic data generated by neurological experts to represent descriptive anamneses of patients with known neurology-related diseases, then the probability for an appropriate diagnosis made by ChatGPT was measured. To give clarity to the accuracy of the AI-determined diagnosis, all cases have been cross-validated by other experts and general medical doctors as well. We found that ChatGPT-determined diagnostic accuracy (ranging from 68.5% +/- 3.28% to 83.83% +/- 2.73%) can reach the accuracy of other experts (81.66% +/- 2.02%), furthermore, it surpasses the probability of an appropriate diagnosis if the examiner is a general medical doctor (57.15% +/- 2.64%). Our results showcase the efficacy of general artificial intelligence like ChatGPT as a diagnostic augmentation tool in medicine. In the future, AI-based supporting tools might be useful amendments in medical practice and help to improve the diagnostic process in neurology.",Nogradi B; Polgar TF; Meszlenyi V; Kadar Z; Hertelendy P; Csati A; Szpisjak L; Halmi D; Erdelyi-Furka B; Toth M; Molnar F; Toth D; Bosze Z; Boda K; Klivenyi P; Siklos L; Patai R
39196640,"Current Status of ChatGPT Use in Medical Education: Potentials, Challenges, and Strategies.",2024,Journal of medical Internet research,,,,"ChatGPT, a generative pretrained transformer, has garnered global attention and sparked discussions since its introduction on November 30, 2022. However, it has generated controversy within the realms of medical education and scientific research. This paper examines the potential applications, limitations, and strategies for using ChatGPT. ChatGPT offers personalized learning support to medical students through its robust natural language generation capabilities, enabling it to furnish answers. Moreover, it has demonstrated significant use in simulating clinical scenarios, facilitating teaching and learning processes, and revitalizing medical education. Nonetheless, numerous challenges accompany these advancements. In the context of education, it is of paramount importance to prevent excessive reliance on ChatGPT and combat academic plagiarism. Likewise, in the field of medicine, it is vital to guarantee the timeliness, accuracy, and reliability of content generated by ChatGPT. Concurrently, ethical challenges and concerns regarding information security arise. In light of these challenges, this paper proposes targeted strategies for addressing them. First, the risk of overreliance on ChatGPT and academic plagiarism must be mitigated through ideological education, fostering comprehensive competencies, and implementing diverse evaluation criteria. The integration of contemporary pedagogical methodologies in conjunction with the use of ChatGPT serves to enhance the overall quality of medical education. To enhance the professionalism and reliability of the generated content, it is recommended to implement measures to optimize ChatGPT's training data professionally and enhance the transparency of the generation process. This ensures that the generated content is aligned with the most recent standards of medical practice. Moreover, the enhancement of value alignment and the establishment of pertinent legislation or codes of practice address ethical concerns, including those pertaining to algorithmic discrimination, the allocation of medical responsibility, privacy, and security. In conclusion, while ChatGPT presents significant potential in medical education, it also encounters various challenges. Through comprehensive research and the implementation of suitable strategies, it is anticipated that ChatGPT's positive impact on medical education will be harnessed, laying the groundwork for advancing the discipline and fostering the development of high-caliber medical professionals.",Xu T; Weng H; Liu F; Yang L; Luo Y; Ding Z; Wang Q
37485160,Eyes on AI: ChatGPT's Transformative Potential Impact on Ophthalmology.,2023,Cureus,,,,"ChatGPT, a large language model by OpenAI, has been adopted in various domains since its release in November 2022, but its application in ophthalmology remains less explored. This editorial assesses ChatGPT's potential applications and limitations in ophthalmology across clinical, educational, and research settings. In clinical settings, ChatGPT can serve as an assistant, offering diagnostic and therapeutic suggestions based on patient data and assisting in patient triage. However, its tendencies to generate inaccurate results and its inability to keep up with recent medical guidelines render it unsuitable for standalone clinical decision-making. Data security and compliance with the Health Insurance Portability and Accountability Act (HIPAA) also pose concerns, given ChatGPT's potential to inadvertently expose sensitive patient information. In education, ChatGPT can generate practice questions, provide explanations, and create patient education materials. However, its performance in answering domain-specific questions is suboptimal. In research, ChatGPT can facilitate literature reviews, data analysis, manuscript development, and peer review, but issues of accuracy, bias, and ethics need careful consideration. Ultimately, ensuring accuracy, ethical integrity, and data privacy is essential when integrating ChatGPT into ophthalmology.",Dossantos J; An J; Javan R
40375935,Quantum leap in medical mentorship: exploring ChatGPT's transition from textbooks to terabytes.,2025,Frontiers in medicine,,,,"ChatGPT, an advanced AI language model, presents a transformative opportunity in several fields including the medical education. This article examines the integration of ChatGPT into healthcare learning environments, exploring its potential to revolutionize knowledge acquisition, personalize education, support curriculum development, and enhance clinical reasoning. The AI's ability to swiftly access and synthesize medical information across various specialties offers significant value to students and professionals alike. It provides rapid answers to queries on medical theories, treatment guidelines, and diagnostic methods, potentially accelerating the learning curve. The paper emphasizes the necessity of verifying ChatGPT's outputs against authoritative medical sources. A key advantage highlighted is the AI's capacity to tailor learning experiences by assessing individual needs, accommodating diverse learning styles, and offering personalized feedback. The article also considers ChatGPT's role in shaping curricula and assessment techniques, suggesting that educators may need to adapt their methods to incorporate AI-driven learning tools. Additionally, it explores how ChatGPT could bolster clinical problem-solving through AI-powered simulations, fostering critical thinking and diagnostic acumen among students. While recognizing ChatGPT's transformative potential in medical education, the article stresses the importance of thoughtful implementation, continuous validation, and the establishment of protocols to ensure its responsible and effective application in healthcare education settings.",Chokkakula S; Chong S; Yang B; Jiang H; Yu J; Han R; Attitalla IH; Yin C; Zhang S
38476626,The Potential Applications and Challenges of ChatGPT in the Medical Field.,2024,International journal of general medicine,,,,"ChatGPT, an AI-driven conversational large language model (LLM), has garnered significant scholarly attention since its inception, owing to its manifold applications in the realm of medical science. This study primarily examines the merits, limitations, anticipated developments, and practical applications of ChatGPT in clinical practice, healthcare, medical education, and medical research. It underscores the necessity for further research and development to enhance its performance and deployment. Moreover, future research avenues encompass ongoing enhancements and standardization of ChatGPT, mitigating its limitations, and exploring its integration and applicability in translational and personalized medicine. Reflecting the narrative nature of this review, a focused literature search was performed to identify relevant publications on ChatGPT's use in medicine. This process was aimed at gathering a broad spectrum of insights to provide a comprehensive overview of the current state and future prospects of ChatGPT in the medical domain. The objective is to aid healthcare professionals in understanding the groundbreaking advancements associated with the latest artificial intelligence tools, while also acknowledging the opportunities and challenges presented by ChatGPT.",Mu Y; He D
38172581,ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology.,2024,"Eye (London, England)",,,,"ChatGPT, an artificial intelligence (AI) chatbot built on large language models (LLMs), has rapidly gained popularity. The benefits and limitations of this transformative technology have been discussed across various fields, including medicine. The widespread availability of ChatGPT has enabled clinicians to study how these tools could be used for a variety of tasks such as generating differential diagnosis lists, organizing patient notes, and synthesizing literature for scientific research. LLMs have shown promising capabilities in ophthalmology by performing well on the Ophthalmic Knowledge Assessment Program, providing fairly accurate responses to questions about retinal diseases, and in generating differential diagnoses list. There are current limitations to this technology, including the propensity of LLMs to ""hallucinate"", or confidently generate false information; their potential role in perpetuating biases in medicine; and the challenges in incorporating LLMs into research without allowing ""AI-plagiarism"" or publication of false information. In this paper, we provide a balanced overview of what LLMs are and introduce some of the LLMs that have been generated in the past few years. We discuss recent literature evaluating the role of these language models in medicine with a focus on ChatGPT. The field of AI is fast-paced, and new applications based on LLMs are being generated rapidly; therefore, it is important for ophthalmologists to be aware of how this technology works and how it may impact patient care. Here, we discuss the benefits, limitations, and future advancements of LLMs in patient care and research.",Kedia N; Sanjeev S; Ong J; Chhablani J
39196686,"""Where No One Has Gone Before"": Questions to Ensure the Ethical, Rigorous, and Thoughtful Application of Artificial Intelligence in the Analysis of HIV Research.",2024,The Journal of the Association of Nurses in AIDS Care : JANAC,,,,"ChatGPT, an artificial intelligence (AI) system released by OpenAI on November 30th, 2022, has upended scientific and educational paradigms, reshaping the way that we think about teaching, writing, and now research. Since that time, qualitative data analytic software programs such as ATLAS.ti have quickly incorporated AI into their programs to assist with or even replace human coding. Qualitative research is key to understanding the complexity and nuance of HIV-related behaviors, through descriptive and historical textual research, as well as the lived experiences of people with HIV. This commentary weighs the pros and cons of the use of AI coding in HIV-related qualitative research. We pose guiding questions that may help researchers evaluate the application and scope of AI in qualitative research as determined by the research question, underlying epistemology, and goal(s). Qualitative data encompasses a variety of media, methodologies, and styles that exist on a spectrum underpinned by epistemology. The research question and the data sources are informed by the researcher's epistemological viewpoint. Given the heterogeneous applications of qualitative research in nursing, medicine, and public health there are circumstances where qualitative AI coding is appropriate, but this should be congruent with the aims and underlying epistemology of the research.",Bergman AJ; McNabb KC; Relf MV; Dredze MH
37038381,Overview of Early ChatGPT's Presence in Medical Literature: Insights From a Hybrid Literature Review by ChatGPT and Human Experts.,2023,Cureus,,,,"ChatGPT, an artificial intelligence chatbot, has rapidly gained prominence in various domains, including medical education and healthcare literature. This hybrid narrative review, conducted collaboratively by human authors and ChatGPT, aims to summarize and synthesize the current knowledge of ChatGPT in the indexed medical literature during its initial four months. A search strategy was employed in PubMed and EuropePMC databases, yielding 65 and 110 papers, respectively. These papers focused on ChatGPT's impact on medical education, scientific research, medical writing, ethical considerations, diagnostic decision-making, automation potential, and criticisms. The findings indicate a growing body of literature on ChatGPT's applications and implications in healthcare, highlighting the need for further research to assess its effectiveness and ethical concerns.",Temsah O; Khan SA; Chaiah Y; Senjab A; Alhasan K; Jamal A; Aljamaan F; Malki KH; Halwani R; Al-Tawfiq JA; Temsah MH; Al-Eyadhy A
37399030,"ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology?",2023,Clinical infectious diseases : an official publication of the Infectious Diseases Society of America,,,,"ChatGPT, GPT-4, and Bard are highly advanced natural language process-based computer programs (chatbots) that simulate and process human conversation in written or spoken form. Recently released by the company OpenAI, ChatGPT was trained on billions of unknown text elements (tokens) and rapidly gained wide attention for its ability to respond to questions in an articulate manner across a wide range of knowledge domains. These potentially disruptive large language model (LLM) technologies have a broad range of conceivable applications in medicine and medical microbiology. In this opinion article, I describe how chatbot technologies work and discuss the strengths and weaknesses of ChatGPT, GPT-4, and other LLMs for applications in the routine diagnostic laboratory, focusing on various use cases for the pre- to post-analytical process.",Egli A
38911678,"ChatGPT in veterinary medicine: a practical guidance of generative artificial intelligence in clinics, education, and research.",2024,Frontiers in veterinary science,,,,"ChatGPT, the most accessible generative artificial intelligence (AI) tool, offers considerable potential for veterinary medicine, yet a dedicated review of its specific applications is lacking. This review concisely synthesizes the latest research and practical applications of ChatGPT within the clinical, educational, and research domains of veterinary medicine. It intends to provide specific guidance and actionable examples of how generative AI can be directly utilized by veterinary professionals without a programming background. For practitioners, ChatGPT can extract patient data, generate progress notes, and potentially assist in diagnosing complex cases. Veterinary educators can create custom GPTs for student support, while students can utilize ChatGPT for exam preparation. ChatGPT can aid in academic writing tasks in research, but veterinary publishers have set specific requirements for authors to follow. Despite its transformative potential, careful use is essential to avoid pitfalls like hallucination. This review addresses ethical considerations, provides learning resources, and offers tangible examples to guide responsible implementation. A table of key takeaways was provided to summarize this review. By highlighting potential benefits and limitations, this review equips veterinarians, educators, and researchers to harness the power of ChatGPT effectively.",Chu CP
37085182,"Early applications of ChatGPT in medical practice, education and research.",2023,"Clinical medicine (London, England)",,,,"ChatGPT, which can automatically generate written responses to queries using internet sources, soon went viral after its release at the end of 2022. The performance of ChatGPT on medical exams shows results near the passing threshold, making it comparable to third-year medical students. It can also write academic abstracts or reviews at an acceptable level. However, it is not clear how ChatGPT deals with harmful content, misinformation or plagiarism; therefore, authors using ChatGPT professionally for academic writing should be cautious. ChatGPT also has the potential to facilitate the interaction between healthcare providers and patients in various ways. However, sophisticated tasks such as understanding the human anatomy are still a limitation of ChatGPT. ChatGPT can simplify radiological reports, but the possibility of incorrect statements and missing medical information remain. Although ChatGPT has the potential to change medical practice, education and research, further improvements of this application are needed for regular use in medicine.",Sedaghat S
38310152,Evaluating AI in medicine: a comparative analysis of expert and ChatGPT responses to colorectal cancer questions.,2024,Scientific reports,,,,"Colorectal cancer (CRC) is a global health challenge, and patient education plays a crucial role in its early detection and treatment. Despite progress in AI technology, as exemplified by transformer-like models such as ChatGPT, there remains a lack of in-depth understanding of their efficacy for medical purposes. We aimed to assess the proficiency of ChatGPT in the field of popular science, specifically in answering questions related to CRC diagnosis and treatment, using the book ""Colorectal Cancer: Your Questions Answered"" as a reference. In general, 131 valid questions from the book were manually input into ChatGPT. Responses were evaluated by clinical physicians in the relevant fields based on comprehensiveness and accuracy of information, and scores were standardized for comparison. Not surprisingly, ChatGPT showed high reproducibility in its responses, with high uniformity in comprehensiveness, accuracy, and final scores. However, the mean scores of ChatGPT's responses were significantly lower than the benchmarks, indicating it has not reached an expert level of competence in CRC. While it could provide accurate information, it lacked in comprehensiveness. Notably, ChatGPT performed well in domains of radiation therapy, interventional therapy, stoma care, venous care, and pain control, almost rivaling the benchmarks, but fell short in basic information, surgery, and internal medicine domains. While ChatGPT demonstrated promise in specific domains, its general efficiency in providing CRC information falls short of expert standards, indicating the need for further advancements and improvements in AI technology for patient education in healthcare.",Peng W; Feng Y; Yao C; Zhang S; Zhuo H; Qiu T; Zhang Y; Tang J; Gu Y; Sun Y
38244054,Assessment of Pathology Domain-Specific Knowledge of ChatGPT and Comparison to Human Performance.,2024,Archives of pathology & laboratory medicine,,,,"CONTEXT.-: Artificial intelligence algorithms hold the potential to fundamentally change many aspects of society. Application of these tools, including the publicly available ChatGPT, has demonstrated impressive domain-specific knowledge in many areas, including medicine. OBJECTIVES.-: To understand the level of pathology domain-specific knowledge for ChatGPT using different underlying large language models, GPT-3.5 and the updated GPT-4. DESIGN.-: An international group of pathologists (n = 15) was recruited to generate pathology-specific questions at a similar level to those that could be seen on licensing (board) examinations. The questions (n = 15) were answered by GPT-3.5, GPT-4, and a staff pathologist who recently passed their Canadian pathology licensing exams. Participants were instructed to score answers on a 5-point scale and to predict which answer was written by ChatGPT. RESULTS.-: GPT-3.5 performed at a similar level to the staff pathologist, while GPT-4 outperformed both. The overall score for both GPT-3.5 and GPT-4 was within the range of meeting expectations for a trainee writing licensing examinations. In all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5. CONCLUSIONS.-: By demonstrating the ability of ChatGPT to answer pathology-specific questions at a level similar to (GPT-3.5) or exceeding (GPT-4) a trained pathologist, this study highlights the potential of large language models to be transformative in this space. In the future, more advanced iterations of these algorithms with increased domain-specific knowledge may have the potential to assist pathologists and enhance pathology resident training.",Wang AY; Lin S; Tran C; Homer RJ; Wilsdon D; Walsh JC; Goebel EA; Sansano I; Sonawane S; Cockenpot V; Mukhopadhyay S; Taskin T; Zahra N; Cima L; Semerci O; Ozamrak BG; Mishra P; Vennavalli NS; Chen PC; Cecchini MJ
39035703,The role of artificial intelligence in cosmetic and functional gynecology: Stepping into the third millennium.,2024,European journal of obstetrics & gynecology and reproductive biology: X,,,,"Cosmetic and functional gynecology have gained popularity among patients, but the scientific literature in this field, particularly regarding the cosmetic aspect, is lacking. The use of evidence-based medicine is crucial to validate diagnostic tools and treatment protocols. However, the advent of artificial intelligence (AI) offers a promising solution to address this issue. ChatGPT, a sophisticated language model, can revolutionize AI in medicine, enabling accurate diagnosis, personalized treatment plans, and expedited research analysis. Cosmetic and functional gynecology can leverage AI to develop the field and improve evidence gathering. AI can aid in precise and personalized diagnosis, implement standardized assessment tools, simulate treatment outcomes, and assess under-skin anatomy through virtual reality. AI tools can assist clinicians in diagnosing and comparing difficult cases, calculate treatment risks, and contribute to standardization by collecting global evidence and generating guidelines. The use of AI in cosmetic and functional gynecology holds significant potential to advance the field and improve patient outcomes. This novel combination of AI and gynecology represents a groundbreaking development in medicine, emphasizing the importance of appropriate and correct AI usage.",Buzzaccarini G; Degliuomini RS; Etrusco A; Giannini A; D'Amato A; Gkouvi K; Berreni N; Magon N; Candiani M; Salvatore S
38873307,Assessing the utility of artificial intelligence throughout the triage outpatients: a prospective randomized controlled clinical study.,2024,Frontiers in public health,,,,"Currently, there are still many patients who require outpatient triage assistance. ChatGPT, a natural language processing tool powered by artificial intelligence technology, is increasingly utilized in medicine. To facilitate and expedite patients' navigation to the appropriate department, we conducted an outpatient triage evaluation of ChatGPT. For this evaluation, we posed 30 highly representative and common outpatient questions to ChatGPT and scored its responses using a panel of five experienced doctors. The consistency of manual triage and ChatGPT triage was assessed by five experienced doctors, and statistical analysis was performed using the Chi-square test. The expert ratings of ChatGPT's answers to these 30 frequently asked questions revealed 17 responses earning very high scores (10 and 9.5 points), 7 earning high scores (9 points), and 6 receiving low scores (8 and 7 points). Additionally, we conducted a prospective cohort study in which 45 patients completed forms detailing gender, age, and symptoms. Triage was then performed by outpatient triage staff and ChatGPT. Among the 45 patients, we found a high level of agreement between manual triage and ChatGPT triage (consistency: 93.3-100%, p<0.0001). We were pleasantly surprised to observe that ChatGPT's responses were highly professional, comprehensive, and humanized. This innovation can help patients win more treatment time, improve patient diagnosis and cure rates, and alleviate the pressure of medical staff shortage.",Liu X; Lai R; Wu C; Yan C; Gan Z; Yang Y; Zeng X; Liu J; Liao L; Lin Y; Jing H; Zhang W
40267969,Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning.,2025,Nature medicine,,,,"DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs- DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B-in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10(-3)). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Medicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10(-3)). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10(-3)). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.",Tordjman M; Liu Z; Yuce M; Fauveau V; Mei Y; Hadjadj J; Bolger I; Almansour H; Horst C; Parihar AS; Geahchan A; Meribout A; Yatim N; Ng N; Robson P; Zhou A; Lewis S; Huang M; Deyer T; Taouli B; Lee HC; Fayad ZA; Mei X
38477276,AI Chatbots and Challenges of HIPAA Compliance for AI Developers and Vendors.,2023,"The Journal of law, medicine & ethics : a journal of the American Society of Law, Medicine & Ethics",,,,"Developers and vendors of large language models (""LLMs"") - such as ChatGPT, Google Bard, and Microsoft's Bing at the forefront-can be subject to Health Insurance Portability and Accountability Act of 1996 (""HIPAA"") when they process protected health information (""PHI"") on behalf of the HIPAA covered entities. In doing so, they become business associates or subcontractors of a business associate under HIPAA.",Rezaeikhonakdar D
38826194,Exploiting ChatGPT for Diagnosing Autism-Associated Language Disorders and Identifying Distinct Features.,2024,Research square,,,,"Diagnosing language disorders associated with autism is a complex and nuanced challenge, often hindered by the subjective nature and variability of traditional assessment methods. Traditional diagnostic methods not only require intensive human effort but also often result in delayed interventions due to their lack of speed and specificity. In this study, we explored the application of ChatGPT, a state-of-the-art large language model, to overcome these obstacles by enhancing diagnostic accuracy and profiling specific linguistic features indicative of autism. Leveraging ChatGPT's advanced natural language processing capabilities, this research aims to streamline and refine the diagnostic process. Specifically, we compared ChatGPT's performance with that of conventional supervised learning models, including BERT, a model acclaimed for its effectiveness in various natural language processing tasks. We showed that ChatGPT substantially outperformed these models, achieving over 13% improvement in both accuracy and F1-score in a zero-shot learning configuration. This marked enhancement highlights the model's potential as a superior tool for neurological diagnostics. Additionally, we identified ten distinct features of autism-associated language disorders that vary significantly across different experimental scenarios. These features, which included echolalia, pronoun reversal, and atypical language usage, were crucial for accurately diagnosing ASD and customizing treatment plans. Together, our findings advocate for adopting sophisticated AI tools like ChatGPT in clinical settings to assess and diagnose developmental disorders. Our approach not only promises greater diagnostic precision but also aligns with the goals of personalized medicine, potentially transforming the evaluation landscape for autism and similar neurological conditions.",Hu C; Li W; Ruan M; Yu X; Paul LK; Wang S; Li X
38056135,ChatGPT as an aid for pathological diagnosis of cancer.,2024,"Pathology, research and practice",,,,"Diagnostic workup of cancer patients is highly reliant on the science of pathology using cytopathology, histopathology, and other ancillary techniques like immunohistochemistry and molecular cytogenetics. Data processing and learning by means of artificial intelligence (AI) has become a spearhead for the advancement of medicine, with pathology and laboratory medicine being no exceptions. ChatGPT, an artificial intelligence (AI)-based chatbot, that was recently launched by OpenAI, is currently a talk of the town, and its role in cancer diagnosis is also being explored meticulously. Pathology workflow by integration of digital slides, implementation of advanced algorithms, and computer-aided diagnostic techniques extend the frontiers of the pathologist's view beyond a microscopic slide and enables effective integration, assimilation, and utilization of knowledge that is beyond human limits and boundaries. Despite of it's numerous advantages in the pathological diagnosis of cancer, it comes with several challenges like integration of digital slides with input language parameters, problems of bias, and legal issues which have to be addressed and worked up soon so that we as a pathologists diagnosing malignancies are on the same band wagon and don't miss the train.",Malik S; Zaheer S
40119714,Analysis of responses from artificial intelligence programs to medication-related questions derived from critical care guidelines.,2025,American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists,,,,"DISCLAIMER: In an effort to expedite the publication of articles, AJHP is posting manuscripts online as soon as possible after acceptance. Accepted manuscripts have been peer-reviewed and copyedited, but are posted online before technical formatting and author proofing. These manuscripts are not the final version of record and will be replaced with the final article (formatted per AJHP style and proofed by the authors) at a later time. PURPOSE: To evaluate the recommendations given by 4 publicly available artificial intelligence (AI) programs in comparison to recommendations in current clinical practice guidelines (CPGs) focused on critically ill adults. METHODS: This study evaluated 4 publicly available large language models (LLMs): ChatGPT 4.0, Microsoft Copilot Google Gemini Version 1.5, and Meta AI. Each AI chatbot was prompted with medication-related questions related to 6 CPGs published by the Society of Critical Care Medicine (SCCM) and also asked to provide references to support its recommendations. Responses were categorized as correct, partially correct, not correct, or ""other"" (eg, the LLM answered a question not asked). RESULTS: In total, 43 responses were recorded for each AI program, with a significant difference (P = 0.007) in response types by AI program. Microsoft Copilot had the highest proportion of correct recommendations, followed by Meta AI, ChatGPT 4.0, and Google Gemini. All 4 LLMs gave some incorrect recommendations, with Gemini having the most incorrect responses, followed closely by ChatGPT. Copilot had the most responses in the ""other"" category (n = 5, 11.63%). On average, ChatGPT provided the greatest number of references per question (n = 4.54), followed by Google Gemini (n = 3.43), Meta AI (n = 3.06), and Microsoft Copilot (n = 2.04). CONCLUSION: Although they showed potential for future utility to pharmacists with further development and refinement, the evaluated AI programs did not consistently give accurate medication-related recommendations for the purpose of answering clinical questions such as those pertaining to critical care CPGs.",Williams B; Erstad BL
38578309,Diagnostic power of ChatGPT 4 in distal radius fracture detection through wrist radiographs.,2024,Archives of orthopaedic and trauma surgery,,,,"Distal radius fractures rank among the most prevalent fractures in humans, necessitating accurate radiological imaging and interpretation for optimal diagnosis and treatment. In addition to human radiologists, artificial intelligence systems are increasingly employed for radiological assessments. Since 2023, ChatGPT 4 has offered image analysis capabilities, which can also be used for the analysis of wrist radiographs. This study evaluates the diagnostic power of ChatGPT 4 in identifying distal radius fractures, comparing it with a board-certified radiologist, a hand surgery resident, a medical student, and the well-established AI Gleamer BoneView. Results demonstrate ChatGPT 4's good diagnostic accuracy (sensitivity 0.88, specificity 0.98, diagnostic power (AUC) 0.93), surpassing the medical student (sensitivity 0.98, specificity 0.72, diagnostic power (AUC) 0.85; p = 0.04) significantly. Nevertheless, the diagnostic power of ChatGPT 4 lags behind the hand surgery resident (sensitivity 0.99, specificity 0.98, diagnostic power (AUC) 0.985; p = 0.014) and Gleamer BoneView(sensitivity 1.00, specificity 0.98, diagnostic power (AUC) 0.99; p = 0.006). This study highlights the utility and potential applications of artificial intelligence in modern medicine, emphasizing ChatGPT 4 as a valuable tool for enhancing diagnostic capabilities in the field of medical imaging.",Mert S; Stoerzer P; Brauer J; Fuchs B; Haas-Lutzenberger EM; Demmer W; Giunta RE; Nuernberger T
38409350,Leveraging generative AI to prioritize drug repurposing candidates for Alzheimer's disease with real-world clinical validation.,2024,NPJ digital medicine,,,,"Drug repurposing represents an attractive alternative to the costly and time-consuming process of new drug development, particularly for serious, widespread conditions with limited effective treatments, such as Alzheimer's disease (AD). Emerging generative artificial intelligence (GAI) technologies like ChatGPT offer the promise of expediting the review and summary of scientific knowledge. To examine the feasibility of using GAI for identifying drug repurposing candidates, we iteratively tasked ChatGPT with proposing the twenty most promising drugs for repurposing in AD, and tested the top ten for risk of incident AD in exposed and unexposed individuals over age 65 in two large clinical datasets: (1) Vanderbilt University Medical Center and (2) the All of Us Research Program. Among the candidates suggested by ChatGPT, metformin, simvastatin, and losartan were associated with lower AD risk in meta-analysis. These findings suggest GAI technologies can assimilate scientific insights from an extensive Internet-based search space, helping to prioritize drug repurposing candidates and facilitate the treatment of diseases.",Yan C; Grabowska ME; Dickson AL; Li B; Wen Z; Roden DM; Michael Stein C; Embi PJ; Peterson JF; Feng Q; Malin BA; Wei WQ
37503019,Leveraging Generative AI to Prioritize Drug Repurposing Candidates: Validating Identified Candidates for Alzheimer's Disease in Real-World Clinical Datasets.,2023,Research square,,,,"Drug repurposing represents an attractive alternative to the costly and time-consuming process of new drug development, particularly for serious, widespread conditions with limited effective treatments, such as Alzheimer's disease (AD). Emerging generative artificial intelligence (GAI) technologies like ChatGPT offer the promise of expediting the review and summary of scientific knowledge. To examine the feasibility of using GAI for identifying drug repurposing candidates, we iteratively tasked ChatGPT with proposing the twenty most promising drugs for repurposing in AD, and tested the top ten for risk of incident AD in exposed and unexposed individuals over age 65 in two large clinical datasets: 1) Vanderbilt University Medical Center and 2) the All of Us Research Program. Among the candidates suggested by ChatGPT, metformin, simvastatin, and losartan were associated with lower AD risk in meta-analysis. These findings suggest GAI technologies can assimilate scientific insights from an extensive Internet-based search space, helping to prioritize drug repurposing candidates and facilitate the treatment of diseases.",Wei WQ; Yan C; Grabowska M; Dickson A; Li B; Wen Z; Roden D; Stein C; Embi P; Peterson J; Feng Q; Malin B
37461512,Leveraging Generative AI to Prioritize Drug Repurposing Candidates: Validating Identified Candidates for Alzheimer's Disease in Real-World Clinical Datasets.,2023,medRxiv : the preprint server for health sciences,,,,"Drug repurposing represents an attractive alternative to the costly and time-consuming process of new drug development, particularly for serious, widespread conditions with limited effective treatments, such as Alzheimer's disease (AD). Emerging generative artificial intelligence (GAI) technologies like ChatGPT offer the promise of expediting the review and summary of scientific knowledge. To examine the feasibility of using GAI for identifying drug repurposing candidates, we iteratively tasked ChatGPT with proposing the twenty most promising drugs for repurposing in AD, and tested the top ten for risk of incident AD in exposed and unexposed individuals over age 65 in two large clinical datasets: 1) Vanderbilt University Medical Center and 2) the All of Us Research Program. Among the candidates suggested by ChatGPT, metformin, simvastatin, and losartan were associated with lower AD risk in meta-analysis. These findings suggest GAI technologies can assimilate scientific insights from an extensive Internet-based search space, helping to prioritize drug repurposing candidates and facilitate the treatment of diseases.",Yan C; Grabowska ME; Dickson AL; Li B; Wen Z; Roden DM; Stein CM; Embi PJ; Peterson JF; Feng Q; Malin BA; Wei WQ
39359001,Evaluating the capability of ChatGPT in predicting drug-drug interactions: Real-world evidence using hospitalized patient data.,2024,British journal of clinical pharmacology,,,,"Drug-drug interactions (DDIs) present a significant health burden, compounded by clinician time constraints and poor patient health literacy. We assessed the ability of ChatGPT (generative artificial intelligence-based large language model) to predict DDIs in a real-world setting. Demographics, diagnoses and prescribed medicines for 120 hospitalized patients were input through three standardized prompts to ChatGPT version 3.5 and compared against pharmacist DDI evaluation to estimate diagnostic accuracy. Area under receiver operating characteristic and inter-rater reliability (Cohen's and Fleiss' kappa coefficients) were calculated. ChatGPT's responses differed based on prompt wording style, with higher sensitivity for prompts mentioning 'drug interaction'. Confusion matrices displayed low true positive and high true negative rates, and there was minimal agreement between ChatGPT and pharmacists (Cohen's kappa values 0.077-0.143). Low sensitivity values suggest a lack of success in identifying DDIs by ChatGPT, and further development is required before it can reliably assess potential DDIs in real-world scenarios.",Radha Krishnan RP; Hung EH; Ashford M; Edillo CE; Gardner C; Hatrick HB; Kim B; Lai AWY; Li X; Zhao YX; Raubenheimer JE
39190012,Toward an Explainable Large Language Model for the Automatic Identification of the Drug-Induced Liver Injury Literature.,2024,Chemical research in toxicology,,,,"Drug-induced liver injury (DILI) stands as a significant concern in drug safety, representing the primary cause of acute liver failure. Identifying the scientific literature related to DILI is crucial for monitoring, investigating, and conducting meta-analyses of drug safety issues. Given the intricate and often obscure nature of drug interactions, simple keyword searching can be insufficient for the exhaustive retrieval of the DILI-relevant literature. Manual curation of DILI-related publications demands pharmaceutical expertise and is susceptible to errors, severely limiting throughput. Despite numerous efforts utilizing cutting-edge natural language processing and deep learning techniques to automatically identify the DILI-related literature, their performance remains suboptimal for real-world applications in clinical research and regulatory contexts. In the past year, large language models (LLMs) such as ChatGPT and its open-source counterpart LLaMA have achieved groundbreaking progress in natural language understanding and question answering, paving the way for the automated, high-throughput identification of the DILI-related literature and subsequent analysis. Leveraging a large-scale public dataset comprising 14 203 training publications from the CAMDA 2022 literature AI challenge, we have developed what we believe to be the first LLM specialized in DILI analysis based on LLaMA-2. In comparison with other smaller language models such as BERT, GPT, and their variants, LLaMA-2 exhibits an enhanced out-of-fold accuracy of 97.19% and area under the ROC curve of 0.9947 using 3-fold cross-validation on the training set. Despite LLMs' initial design for dialogue systems, our study illustrates their successful adaptation into accurate classifiers for automated identification of the DILI-related literature from vast collections of documents. This work is a step toward unleashing the potential of LLMs in the context of regulatory science and facilitating the regulatory review process.",Ma C; Wolfinger RD
40114317,Application of Large Language Models in Drug-Induced Osteotoxicity Prediction.,2025,Journal of chemical information and modeling,,,,"Drug-induced osteotoxicity refers to the harmful effects certain drugs have on the skeletal system, posing significant safety risks. These toxic effects are a key concern in clinical practice, drug development, and environmental management. However, existing toxicity assessment models lack specialized data sets and algorithms for predicting osteotoxicity. In our study, we collected osteotoxic molecules and employed various large language models, including DeepSeek and ChatGPT, alongside traditional machine learning methods to predict their properties. Among these, the DeepSeek R1 and ChatGPT o3 models achieved ACC values of 0.87 and 0.88, respectively. Our results indicate that machine learning methods can assist in evaluating the impact of harmful substances on bone health during drug development, improving safety protocols, mitigating skeletal side effects, and enhancing treatment outcomes and public safety. Furthermore, it highlights the potential of large language models in predicting molecular toxicity and their significance in the fields of health and chemical sciences.",Chen YQ; Yu T; Song ZQ; Wang CY; Luo JT; Xiao Y; Qiu H; Wang QQ; Jin HM
39593819,Artificial Intelligence Diagnosing of Oral Lichen Planus: A Comparative Study.,2024,"Bioengineering (Basel, Switzerland)",,,,"Early diagnosis of oral lichen planus (OLP) is challenging, which traditionally is dependent on clinical experience and subjective interpretation. Artificial intelligence (AI) technology has been widely applied in objective and rapid diagnoses. In this study, we aim to investigate the potential of AI diagnosis in OLP and evaluate its effectiveness in improving diagnostic accuracy and accelerating clinical decision making. A total of 128 confirmed OLP patients were included, and lesion images from various anatomical sites were collected. The diagnosis was performed using AI platforms, including ChatGPT-4O, ChatGPT (Diagram-Date extension), and Claude Opus, for AI directly identification and AI pre-training identification. After OLP feature training, the diagnostic accuracy of the AI platforms significantly improved, with the overall recognition rates of ChatGPT-4O, ChatGPT (Diagram-Date extension), and Claude Opus increasing from 59%, 68%, and 15% to 77%, 80%, and 50%, respectively. Additionally, the pre-training recognition rates for buccal mucosa reached 94%, 93%, and 56%, respectively. However, the AI platforms performed less effectively when recognizing lesions in less common sites and complex cases; for instance, the pre-training recognition rates for the gums were only 60%, 60%, and 20%, demonstrating significant limitations. The study highlights the strengths and limitations of different AI technologies and provides a reference for future AI applications in oral medicine.",Yu S; Sun W; Mi D; Jin S; Wu X; Xin B; Zhang H; Wang Y; Sun X; He X
38580746,Scientific figures interpreted by ChatGPT: strengths in plot recognition and limits in color perception.,2024,NPJ precision oncology,,,,"Emerging studies underscore the promising capabilities of large language model-based chatbots in conducting basic bioinformatics data analyses. The recent feature of accepting image inputs by ChatGPT, also known as GPT-4V(ision), motivated us to explore its efficacy in deciphering bioinformatics scientific figures. Our evaluation with examples in cancer research, including sequencing data analysis, multimodal network-based drug repositioning, and tumor clonal evolution, revealed that ChatGPT can proficiently explain different plot types and apply biological knowledge to enrich interpretations. However, it struggled to provide accurate interpretations when color perception and quantitative analysis of visual elements were involved. Furthermore, while the chatbot can draft figure legends and summarize findings from the figures, stringent proofreading is imperative to ensure the accuracy and reliability of the content.",Wang J; Ye Q; Liu L; Guo NL; Hu G
37904927,"Bioinformatics Illustrations Decoded by ChatGPT: The Good, The Bad, and The Ugly.",2023,bioRxiv : the preprint server for biology,,,,"Emerging studies underscore the promising capabilities of large language model-based chatbots in conducting fundamental bioinformatics data analyses. The recent feature of accepting image-inputs by ChatGPT motivated us to explore its efficacy in deciphering bioinformatics illustrations. Our evaluation with examples in cancer research, including sequencing data analysis, multimodal network-based drug repositioning, and tumor clonal evolution, revealed that ChatGPT can proficiently explain different plot types and apply biological knowledge to enrich interpretations. However, it struggled to provide accurate interpretations when quantitative analysis of visual elements was involved. Furthermore, while the chatbot can draft figure legends and summarize findings from the figures, stringent proofreading is imperative to ensure the accuracy and reliability of the content.",Wang J; Ye Q; Liu L; Lan Guo N; Hu G
37866949,[Reflections on the Implications of the Developments in ChatGPT for Changes in Medical Education Models].,2023,Sichuan da xue xue bao. Yi xue ban = Journal of Sichuan University. Medical science edition,,,,"Ever since its official launch, Chat Generative Pre-Trained Transformer, or ChatGPT, a natural language processing tool driven by artificial intelligence (AI) technology, has attracted much attention from the education community. ChatGPT can play an important role in the field of medical education, with its potential applications ranging from assisting teachers in designing individualized teaching scenarios to enhancing students' practical ability for solving clinical problems and improving teaching and research efficiency. With the developments in technology, it is inevitable that ChatGPT, or other generative AI models, will be thoroughly integrated in more and more medical contexts, which will further enhance the efficiency and quality of medical services and allow doctors to spend more time interacting with patients and implement personalized health management. Herein, we suggested that proactive reflections be made to figure out the best way to cultivate health professional in the context of New Medical Education, to help more medical professionals enhance their understanding of developments in artificial intelligence, and to make preparations for the challenges that will emerge in the new round of technological revolution. Medical educators should focus on guiding students to make proper use of AI tools in the appropriate context, thereby prevening abuse or overreliance caused by a lack of discrimating ability. Teachers should focus on helping medical students make improvements in clinical reasoning skills, self-directed learning, and clinical practical skills. Teachers should stress the importance for medical students to understand the philosophical implications of the mind-body unity concept, holistic medical thinking, and systematic medical thinking. It is important to enhance medical students' humanistic qualities, cultivate their empathy and communication skills, and continually enhance their ability to meet the requirements of individualized precision diagnosis and treatment so that they will better adapt to the future developments in medicine.",Qu X; Yang J; Chen T; Zhang W
36960451,AI-generated research paper fabrication and plagiarism in the scientific community.,2023,"Patterns (New York, N.Y.)",,,,Fabricating research within the scientific community has consequences for one's credibility and undermines honest authors. We demonstrate the feasibility of fabricating research using an AI-based language model chatbot. Human detection versus AI detection will be compared to determine accuracy in identifying fabricated works. The risks of utilizing AI-generated research works will be underscored and reasons for falsifying research will be highlighted.,Elali FR; Rachid LN
37918623,Leveraging GPT-4 for food effect summarization to enhance product-specific guidance development via iterative prompting.,2023,Journal of biomedical informatics,,,,"Food effect summarization from New Drug Application (NDA) is an essential component of product-specific guidance (PSG) development and assessment, which provides the basis of recommendations for fasting and fed bioequivalence studies to guide the pharmaceutical industry for developing generic drug products. However, manual summarization of food effect from extensive drug application review documents is time-consuming. Therefore, there is a need to develop automated methods to generate food effect summary. Recent advances in natural language processing (NLP), particularly large language models (LLMs) such as ChatGPT and GPT-4, have demonstrated great potential in improving the effectiveness of automated text summarization, but its ability with regard to the accuracy in summarizing food effect for PSG assessment remains unclear. In this study, we introduce a simple yet effective approach,iterative prompting, which allows one to interact with ChatGPT or GPT-4 more effectively and efficiently through multi-turn interaction. Specifically, we propose a three-turn iterative prompting approach to food effect summarization in which the keyword-focused and length-controlled prompts are respectively provided in consecutive turns to refine the quality of the generated summary. We conduct a series of extensive evaluations, ranging from automated metrics to FDA professionals and even evaluation by GPT-4, on 100 NDA review documents selected over the past five years. We observe that the summary quality is progressively improved throughout the iterative prompting process. Moreover, we find that GPT-4 performs better than ChatGPT, as evaluated by FDA professionals (43% vs. 12%) and GPT-4 (64% vs. 35%). Importantly, all the FDA professionals unanimously rated that 85% of the summaries generated by GPT-4 are factually consistent with the golden reference summary, a finding further supported by GPT-4 rating of 72% consistency. Taken together, these results strongly suggest a great potential for GPT-4 to draft food effect summaries that could be reviewed by FDA professionals, thereby improving the efficiency of the PSG assessment cycle and promoting generic drug product development.",Shi Y; Ren P; Wang J; Han B; ValizadehAslani T; Agbavor F; Zhang Y; Hu M; Zhao L; Liang H
39216679,"Editorial Commentary: Large Language Models Like ChatGPT Show Promise, but Clinical Use of Artificial Intelligence Requires Physician Partnership.",2025,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"Forcing ChatGPT and other large language models to perform roles reserved for physicians and other health care professionals-namely evaluation, management, and triage-poses a threat from regulatory, risk management, and professional perspectives. The clinical practice of medicine would benefit tremendously from automated administrative support with systems-based transparency and fluidity-not substitution for clinical diagnostics and decision making. ChatGPT and other large language models are not intended or authorized for clinical use, let alone to be tested or rubber stamped for this application. The best clinical use cases of artificial intelligence require physician partnership to enable personal care, minimize administrative burden, maximize efficiency, and minimize risk-without substitution of core physician tasks.",Ramkumar PN; Woo JJ
38715436,"The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians.",2024,Internal medicine journal,,,,"Foundation machine learning models are deep learning models capable of performing many different tasks using different data modalities such as text, audio, images and video. They represent a major shift from traditional task-specific machine learning prediction models. Large language models (LLM), brought to wide public prominence in the form of ChatGPT, are text-based foundational models that have the potential to transform medicine by enabling automation of a range of tasks, including writing discharge summaries, answering patients questions and assisting in clinical decision-making. However, such models are not without risk and can potentially cause harm if their development, evaluation and use are devoid of proper scrutiny. This narrative review describes the different types of LLM, their emerging applications and potential limitations and bias and likely future translation into clinical practice.",Scott IA; Zuccon G
37893850,Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration.,2023,"Healthcare (Basel, Switzerland)",,,,"Generative artificial intelligence (AI) and large language models (LLMs), exemplified by ChatGPT, are promising for revolutionizing data and information management in healthcare and medicine. However, there is scant literature guiding their integration for non-AI professionals. This study conducts a scoping literature review to address the critical need for guidance on integrating generative AI and LLMs into healthcare and medical practices. It elucidates the distinct mechanisms underpinning these technologies, such as Reinforcement Learning from Human Feedback (RLFH), including few-shot learning and chain-of-thought reasoning, which differentiates them from traditional, rule-based AI systems. It requires an inclusive, collaborative co-design process that engages all pertinent stakeholders, including clinicians and consumers, to achieve these benefits. Although global research is examining both opportunities and challenges, including ethical and legal dimensions, LLMs offer promising advancements in healthcare by enhancing data management, information retrieval, and decision-making processes. Continued innovation in data acquisition, model fine-tuning, prompt strategy development, evaluation, and system implementation is imperative for realizing the full potential of these technologies. Organizations should proactively engage with these technologies to improve healthcare quality, safety, and efficiency, adhering to ethical and legal guidelines for responsible application.",Yu P; Xu H; Hu X; Deng C
38762072,Generative artificial intelligence in ophthalmology.,2025,Survey of ophthalmology,,,,"Generative artificial intelligence (AI) has revolutionized medicine over the past several years. A generative adversarial network (GAN) is a deep learning framework that has become a powerful technique in medicine, particularly in ophthalmology for image analysis. In this paper we review the current ophthalmic literature involving GANs, and highlight key contributions in the field. We briefly touch on ChatGPT, another application of generative AI, and its potential in ophthalmology. We also explore the potential uses for GANs in ocular imaging, with a specific emphasis on 3 primary domains: image enhancement, disease identification, and generating of synthetic data. PubMed, Ovid MEDLINE, Google Scholar were searched from inception to October 30, 2022, to identify applications of GAN in ophthalmology. A total of 40 papers were included in this review. We cover various applications of GANs in ophthalmic-related imaging including optical coherence tomography, orbital magnetic resonance imaging, fundus photography, and ultrasound; however, we also highlight several challenges that resulted in the generation of inaccurate and atypical results during certain iterations. Finally, we examine future directions and considerations for generative AI in ophthalmology.",Waisberg E; Ong J; Kamran SA; Masalkhi M; Paladugu P; Zaman N; Lee AG; Tavakkoli A
39059040,Evaluating generative AI responses to real-world drug-related questions.,2024,Psychiatry research,,,,"Generative Artificial Intelligence (AI) systems such as OpenAI's ChatGPT, capable of an unprecedented ability to generate human-like text and converse in real time, hold potential for large-scale deployment in clinical settings such as substance use treatment. Treatment for substance use disorders (SUDs) is particularly high stakes, requiring evidence-based clinical treatment, mental health expertise, and peer support. Thus, promises of AI systems addressing deficient healthcare resources and structural bias are relevant within this domain, especially in an anonymous setting. This study explores the effectiveness of generative AI in answering real-world substance use and recovery questions. We collect questions from online recovery forums, use ChatGPT and Meta's LLaMA-2 for responses, and have SUD clinicians rate these AI responses. While clinicians rated the AI-generated responses as high quality, we discovered instances of dangerous disinformation, including disregard for suicidal ideation, incorrect emergency helplines, and endorsement of home detox. Moreover, the AI systems produced inconsistent advice depending on question phrasing. These findings indicate a risky mix of seemingly high-quality, accurate responses upon initial inspection that contain inaccurate and potentially deadly medical advice. Consequently, while generative AI shows promise, its real-world application in sensitive healthcare domains necessitates further safeguards and clinical validation.",Giorgi S; Isman K; Liu T; Fried Z; Sedoc J; Curtis B
39974299,DeepSeek in Healthcare: Revealing Opportunities and Steering Challenges of a New Open-Source Artificial Intelligence Frontier.,2025,Cureus,,,,"Generative Artificial Intelligence (GAI) has driven several advancements in healthcare, with large language models (LLMs) such as OpenAI's ChatGPT, Google's Gemini, and Microsoft's Copilot demonstrating potential in clinical decision support, medical education, and research acceleration. However, their closed-source architecture, high computational costs, and limited adaptability to specialized medical contexts remained key barriers to universal adoption. Now, with the rise of DeepSeek's DeepThink (R1), an open-source LLM, gaining prominence since mid-January 2025, new opportunities and challenges emerge for healthcare integration and AI-driven research. Unlike proprietary models, DeepSeek fosters continuous learning by leveraging publicly available open-source datasets, possibly enhancing adaptability to the ever-evolving medical knowledge and scientific reasoning. Its transparent, community-driven approach may enable greater customization, regional specialization, and collaboration among data researchers and clinicians. Additionally, DeepSeek supports offline deployment, addressing some data privacy concerns. Despite these promising advantages, DeepSeek presents ethical and regulatory challenges. Users' data privacy worries have emerged, with concerns about user data retention policies and potential developer access to user-generated content without opt-out options. Additionally, when used in healthcare applications, its compliance with China's data-sharing regulations highlights the urgent need for clear international data privacy and governance. Furthermore, like other LLMs, DeepSeek may face limitations related to inherent biases, hallucinations, and output reliability, which warrants rigorous validation and human oversight before clinical application. This editorial explores DeepSeek's potential role in clinical workflows, medical education, and research while also highlighting its challenges related to security, accuracy, and responsible AI governance. With careful implementation, ethical considerations, and international collaboration, DeepSeek and similar LLMs could enhance healthcare innovation, providing cost-effective, scalable AI solutions while ensuring human expertise remains at the forefront of patient care.",Temsah A; Alhasan K; Altamimi I; Jamal A; Al-Eyadhy A; Malki KH; Temsah MH
39820845,Generative Artificial Intelligence Use in Healthcare: Opportunities for Clinical Excellence and Administrative Efficiency.,2025,Journal of medical systems,,,,"Generative Artificial Intelligence (Gen AI) has transformative potential in healthcare to enhance patient care, personalize treatment options, train healthcare professionals, and advance medical research. This paper examines various clinical and non-clinical applications of Gen AI. In clinical settings, Gen AI supports the creation of customized treatment plans, generation of synthetic data, analysis of medical images, nursing workflow management, risk prediction, pandemic preparedness, and population health management. By automating administrative tasks such as medical documentations, Gen AI has the potential to reduce clinician burnout, freeing more time for direct patient care. Furthermore, application of Gen AI may enhance surgical outcomes by providing real-time feedback and automation of certain tasks in operating rooms. The generation of synthetic data opens new avenues for model training for diseases and simulation, enhancing research capabilities and improving predictive accuracy. In non-clinical contexts, Gen AI improves medical education, public relations, revenue cycle management, healthcare marketing etc. Its capacity for continuous learning and adaptation enables it to drive ongoing improvements in clinical and operational efficiencies, making healthcare delivery more proactive, predictive, and precise.",Bhuyan SS; Sateesh V; Mukul N; Galvankar A; Mahmood A; Nauman M; Rai A; Bordoloi K; Basu U; Samuel J
39563479,Mapping the Landscape of Generative Language Models in Dental Education: A Comparison Between ChatGPT and Google Bard.,2025,European journal of dental education : official journal of the Association for Dental Education in Europe,,,,"Generative language models (LLMs) have shown great potential in various fields, including medicine and education. This study evaluated and compared ChatGPT 3.5 and Google Bard within dental education and research. METHODS: We developed seven dental education-related queries to assess each model across various domains: their role in dental education, creation of specific exercises, simulations of dental problems with treatment options, development of assessment tools, proficiency in dental literature and their ability to identify, summarise and critique a specific article. Two blind reviewers scored the responses using defined metrics. The means and standard deviations of the scores were reported, and differences between the scores were analysed using Wilcoxon tests. RESULTS: ChatGPT 3.5 outperformed Bard in several tasks, including the ability to create highly comprehensive, accurate, clear, relevant and specific exercises on dental concepts, generate simulations of dental problems with treatment options and develop assessment tools. On the other hand, Bard was successful in retrieving real research, and it was able to critique the article it selected. Statistically significant differences were noted between the average scores of the two models (p </= 0.05) for domains 1 and 3. CONCLUSION: This study highlights the potential of LLMs as dental education tools, enhancing learning through virtual simulations and critical performance analysis. However, the variability in LLMs' performance underscores the need for targeted training, particularly in evidence-based content generation. It is crucial for educators, students and practitioners to exercise caution when considering the delegation of critical educational or healthcare decisions to computer systems.",Aldukhail S
37751243,A Future of Smarter Digital Health Empowered by Generative Pretrained Transformer.,2023,Journal of medical Internet research,,,,"Generative pretrained transformer (GPT) tools have been thriving, as ignited by the remarkable success of OpenAI's recent chatbot product. GPT technology offers countless opportunities to significantly improve or renovate current health care research and practice paradigms, especially digital health interventions and digital health-enabled clinical care, and a future of smarter digital health can thus be expected. In particular, GPT technology can be incorporated through various digital health platforms in homes and hospitals embedded with numerous sensors, wearables, and remote monitoring devices. In this viewpoint paper, we highlight recent research progress that depicts the future picture of a smarter digital health ecosystem through GPT-facilitated centralized communications, automated analytics, personalized health care, and instant decision-making.",Miao H; Li C; Wang J
40316716,GPT-4's performance in supporting physician decision-making in nephrology multiple-choice questions.,2025,Scientific reports,,,,"Generative Pre-trained Transformer (GPT)-4, a versatile conversational artificial intelligence, has potential applications in medicine, but its ability to support physicians' decision-making remains unclear. We evaluated GPT-4's performance in assisting physicians with nephrology questions. Forty-five single-answer multiple-choice questions were extracted from the Core Curriculum in Nephrology articles published in the American Journal of Kidney Diseases from October 2021 to June 2023. Eight junior physicians without board certification and ten senior physicians with board certification answered these questions twice: first unaided, then with the opportunity to revise their answers based on GPT-4's outputs. GPT-4 correctly answered 77.8% of the questions. Before using GPT-4, junior physicians had a median (interquartile range) proportion of correct answers of 53.3% (48.3-53.3), senior physicians 65.6% (60.6-66.7). After GPT-4 support, the median proportion of correct answers significantly increased to 72.2% (68.3-76.1) for juniors and 75.6% (73.3-80.0) for seniors (p = 0.008, p = 0.004). The improvement was significantly higher for junior physicians (p = 0.017). However, Senior physicians showed a decreased proportion of correct answers in one of the clinical categories. GPT-4 significantly improved physicians' accuracy in nephrology, especially among less experienced physicians, but may have negative impacts in specific subfields. Careful consideration is required when using GPT-4 to support physicians' decision-making.",Noda R; Tanabe K; Ichikawa D; Shibagaki Y
38076902,Evaluating ChatGPT as an Agent for Providing Genetic Education.,2023,bioRxiv : the preprint server for biology,,,,"Genetic disorders are complex and can greatly impact an individual's health and well-being. In this study, we assess the ability of ChatGPT, a language model developed by OpenAI, to answer questions related to three specific genetic disorders: BRCA1, MLH1, and HFE. ChatGPT has shown it can supply articulate answers to a wide spectrum of questions. However, its ability to answer questions related to genetic disorders has yet to be evaluated. The aim of this study is to perform both quantitative and qualitative assessments of ChatGPT's performance in this area. The ability of ChatGPT to provide accurate and useful information to patients was assessed by genetic experts. Here we show that ChatGPT answered 64.7% of the 68 genetic questions asked and was able to respond coherently to complex questions related to the three genes/conditions. Our results reveal that ChatGPT can provide valuable information to individuals seeking information about genetic disorders, however, it still has some limitations and inaccuracies, particularly in understanding human inheritance patterns. The results of this study have implications for both genomics and medicine and can inform future developments in this area. AI platforms, like ChatGPT, have significant potential in the field of genomics. As these technologies become integrated into consumer-facing products, appropriate oversight is required to ensure accurate and safe delivery of medical information. With such oversight and training specifically for genetic information, these platforms could have the potential to augment some clinical interactions.",Walton N; Gracefo S; Sutherland N; Kozel BA; Danford CJ; McGrath SP
39393621,ChatGPT as a medical education resource in cardiology: Mitigating replicability challenges and optimizing model performance.,2024,Current problems in cardiology,,,,"Given the rapid development of large language models (LLMs), such as ChatGPT, in its ability to understand and generate human-like texts, these technologies inspired efforts to explore their capabilities in natural language processing tasks, especially those in healthcare contexts. The performance of these tools have been evaluated thoroughly across medicine in diverse tasks, including standardized medical examinations, medical-decision making, and many others. In this journal, Anaya et al. published a study comparing the readability metrics of medical education resources formulated by ChatGPT with those of major U.S. institutions (AHA, ACC, HFSA) about heart failure. In this work, we provide a critical review of this article and further describe approaches to help mitigate challenges in reproducibility of studies evaluating LLMs in cardiology. Additionally, we provide suggestions to optimize sampling of responses provided by LLMs for future studies. Overall, while the study by Anaya et al. provides a meaningful contribution to literature of LLMs in cardiology, further comprehensive studies are necessary to address current limitations and further strengthen our understanding of these novel tools.",Pillai J; Pillai K
39273750,"Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination.",2024,"Healthcare (Basel, Switzerland)",,,,"Given the widespread application of ChatGPT, we aim to evaluate its proficiency in the emergency medicine specialty written examination. Additionally, we compare the performance of GPT-3.5, GPT-4, GPTs, and GPT-4o. The research seeks to ascertain whether custom GPTs possess the essential capabilities and access to knowledge bases necessary for providing accurate information, and to explore the effectiveness and potential of personalized knowledge bases in supporting the education of medical residents. We evaluated the performance of ChatGPT-3.5, GPT-4, custom GPTs, and GPT-4o on the Emergency Medicine Specialist Examination in Taiwan. Two hundred single-choice exam questions were provided to these AI models, and their responses were recorded. Correct rates were compared among the four models, and the McNemar test was applied to paired model data to determine if there were significant changes in performance. Out of 200 questions, GPT-3.5, GPT-4, custom GPTs, and GPT-4o correctly answered 77, 105, 119, and 138 questions, respectively. GPT-4o demonstrated the highest performance, significantly better than GPT-4, which, in turn, outperformed GPT-3.5, while custom GPTs exhibited superior performance compared to GPT-4 but inferior performance compared to GPT-4o, with all p < 0.05. In the emergency medicine specialty written exam, our findings highlight the value and potential of large language models (LLMs), and highlight their strengths and limitations, especially in question types and image-inclusion capabilities. Not only do GPT-4o and custom GPTs facilitate exam preparation, but they also elevate the evidence level in responses and source accuracy, demonstrating significant potential to transform educational frameworks and clinical practices in medicine.",Liu CL; Ho CT; Wu TC
37076707,GPT-4: a new era of artificial intelligence in medicine.,2023,Irish journal of medical science,,,,"GPT-4 is the latest version of ChatGPT which is reported by OpenAI to have greater problem-solving abilities and an even broader knowledge base. We examined GPT-4's ability to inform us about the latest literature in a given area, and to write a discharge summary for a patient following an uncomplicated surgery and its latest image analysis feature which was reported to be able to identify objects in photos. All things considered, GPT-4 has the potential to help drive medical innovation, from aiding with patient discharge notes, summarizing recent clinical trials, providing information on ethical guidelines, and much more.",Waisberg E; Ong J; Masalkhi M; Kamran SA; Zaman N; Sarker P; Lee AG; Tavakkoli A
39365643,Engine of Innovation in Hospital Pharmacy: Applications and Reflections of ChatGPT.,2024,Journal of medical Internet research,,,,"Hospital pharmacy plays an important role in ensuring medical care quality and safety, especially in the area of drug information retrieval, therapy guidance, and drug-drug interaction management. ChatGPT is a powerful artificial intelligence language model that can generate natural-language texts. Here, we explored the applications and reflections of ChatGPT in hospital pharmacy, where it may enhance the quality and efficiency of pharmaceutical care. We also explored ChatGPT's prospects in hospital pharmacy and discussed its working principle, diverse applications, and practical cases in daily operations and scientific research. Meanwhile, the challenges and limitations of ChatGPT, such as data privacy, ethical issues, bias and discrimination, and human oversight, are discussed. ChatGPT is a promising tool for hospital pharmacy, but it requires careful evaluation and validation before it can be integrated into clinical practice. Some suggestions for future research and development of ChatGPT in hospital pharmacy are provided.",Li X; Guo H; Li D; Zheng Y
38380541,"AI, Machine Learning, and ChatGPT in Hypertension.",2024,"Hypertension (Dallas, Tex. : 1979)",,,,"Hypertension, a leading cause of cardiovascular disease and premature death, remains incompletely understood despite extensive research. Indeed, even though numerous drugs are available, achieving adequate blood pressure control remains a challenge, prompting recent interest in artificial intelligence. To promote the use of machine learning in cardiovascular medicine, this review provides a brief introduction to machine learning and reviews its notable applications in hypertension management and research, such as disease diagnosis and prognosis, treatment decisions, and omics data analysis. The challenges and limitations associated with data-driven predictive techniques are also discussed. The goal of this review is to raise awareness and encourage the hypertension research community to consider machine learning as a key component in developing innovative diagnostic and therapeutic tools for hypertension. By integrating traditional cardiovascular risk factors with genomics, socioeconomic, behavioral, and environmental factors, machine learning may aid in the development of precise risk prediction models and personalized treatment approaches for patients with hypertension.",Layton AT
37485215,"Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions.",2023,Cureus,,,,"Importance Chat Generative Pre-Trained Transformer (ChatGPT) has shown promising performance in various fields, including medicine, business, and law, but its accuracy in specialty-specific medical questions, particularly in ophthalmology, is still uncertain. Purpose This study evaluates the performance of two ChatGPT models (GPT-3.5 and GPT-4) and human professionals in answering ophthalmology questions from the StatPearls question bank, assessing their outcomes, and providing insights into the integration of artificial intelligence (AI) technology in ophthalmology. Methods ChatGPT's performance was evaluated using 467 ophthalmology questions from the StatPearls question bank. These questions were stratified into 11 subcategories, four difficulty levels, and three generalized anatomical categories. The answer accuracy of GPT-3.5, GPT-4, and human participants was assessed. Statistical analysis was conducted via the Kolmogorov-Smirnov test for normality, one-way analysis of variance (ANOVA) for the statistical significance of GPT-3 versus GPT-4 versus human performance, and repeated unpaired two-sample t-tests to compare the means of two groups. Results GPT-4 outperformed both GPT-3.5 and human professionals on ophthalmology StatPearls questions, except in the ""Lens and Cataract"" category. The performance differences were statistically significant overall, with GPT-4 achieving higher accuracy (73.2%) compared to GPT-3.5 (55.5%, p-value < 0.001) and humans (58.3%, p-value < 0.001). There were variations in performance across difficulty levels (rated one to four), but GPT-4 consistently performed better than both GPT-3.5 and humans on level-two, -three, and -four questions. On questions of level-four difficulty, human performance significantly exceeded that of GPT-3.5 (p = 0.008). Conclusion The study's findings demonstrate GPT-4's significant performance improvements over GPT-3.5 and human professionals on StatPearls ophthalmology questions. Our results highlight the potential of advanced conversational AI systems to be utilized as important tools in the education and practice of medicine.",Moshirfar M; Altaf AW; Stoakes IM; Tuttle JJ; Hoopes PC
37103928,Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.,2023,JAMA ophthalmology,,,,"IMPORTANCE: ChatGPT is an artificial intelligence (AI) chatbot that has significant societal implications. Training curricula using AI are being developed in medicine, and the performance of chatbots in ophthalmology has not been characterized. OBJECTIVE: To assess the performance of ChatGPT in answering practice questions for board certification in ophthalmology. DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study used a consecutive sample of text-based multiple-choice questions provided by the OphthoQuestions practice question bank for board certification examination preparation. Of 166 available multiple-choice questions, 125 (75%) were text-based. EXPOSURES: ChatGPT answered questions from January 9 to 16, 2023, and on February 17, 2023. MAIN OUTCOMES AND MEASURES: Our primary outcome was the number of board certification examination practice questions that ChatGPT answered correctly. Our secondary outcomes were the proportion of questions for which ChatGPT provided additional explanations, the mean length of questions and responses provided by ChatGPT, the performance of ChatGPT in answering questions without multiple-choice options, and changes in performance over time. RESULTS: In January 2023, ChatGPT correctly answered 58 of 125 questions (46%). ChatGPT's performance was the best in the category general medicine (11/14; 79%) and poorest in retina and vitreous (0%). The proportion of questions for which ChatGPT provided additional explanations was similar between questions answered correctly and incorrectly (difference, 5.82%; 95% CI, -11.0% to 22.0%; chi21 = 0.45; P = .51). The mean length of questions was similar between questions answered correctly and incorrectly (difference, 21.4 characters; SE, 36.8; 95% CI, -51.4 to 94.3; t = 0.58; df = 123; P = .22). The mean length of responses was similar between questions answered correctly and incorrectly (difference, -80.0 characters; SE, 65.4; 95% CI, -209.5 to 49.5; t = -1.22; df = 123; P = .22). ChatGPT selected the same multiple-choice response as the most common answer provided by ophthalmology trainees on OphthoQuestions 44% of the time. In February 2023, ChatGPT provided a correct response to 73 of 125 multiple-choice questions (58%) and 42 of 78 stand-alone questions (54%) without multiple-choice options. CONCLUSIONS AND RELEVANCE: ChatGPT answered approximately half of questions correctly in the OphthoQuestions free trial for ophthalmic board certification preparation. Medical professionals and trainees should appreciate the advances of AI in medicine while acknowledging that ChatGPT as used in this investigation did not answer sufficient multiple-choice questions correctly for it to provide substantial assistance in preparing for board certification at this time.",Mihalache A; Popovic MM; Muni RH
38954607,Comparative Analysis of Performance of Large Language Models in Urogynecology.,2024,"Urogynecology (Philadelphia, Pa.)",,,,"IMPORTANCE: Despite growing popularity in medicine, data on large language models in urogynecology are lacking. OBJECTIVE: The aim of this study was to compare the performance of ChatGPT-3.5, GPT-4, and Bard on the American Urogynecologic Society self-assessment examination. STUDY DESIGN: The examination features 185 questions with a passing score of 80. We tested 3 models-ChatGPT-3.5, GPT-4, and Bard on every question. Dedicated accounts enabled controlled comparisons. Questions with prompts were inputted into each model's interface, and responses were evaluated for correctness, logical reasoning behind answer choice, and sourcing. Data on subcategory, question type, correctness rate, question difficulty, and reference quality were noted. The Fisher exact or chi2 test was used for statistical analysis. RESULTS: Out of 185 questions, GPT-4 answered 61.6% questions correctly compared with 54.6% for GPT-3.5 and 42.7% for Bard. GPT-4 answered all questions, whereas GPT-3.5 and Bard declined to answer 4 and 25 questions, respectively. All models demonstrated logical reasoning in their correct responses. Performance of all large language models was inversely proportional to the difficulty level of the questions. Bard referenced sources 97.5% of the time, more often than GPT-4 (83.3%) and GPT-3.5 (39%). GPT-3.5 cited books and websites, whereas GPT-4 and Bard additionally cited journal articles and society guidelines. Median journal impact factor and number of citations were 3.6 with 20 citations for GPT-4 and 2.6 with 25 citations for Bard. CONCLUSIONS: Although GPT-4 outperformed GPT-3.5 and Bard, none of the models achieved a passing score. Clinicians should use language models cautiously in patient care scenarios until more evidence emerges.",Yadav GS; Pandit K; Connell PT; Erfani H; Nager CW
39148822,Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.,2024,medRxiv : the preprint server for health sciences,,,,"IMPORTANCE: Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown. OBJECTIVE: To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources. DESIGN: Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024. SETTING: Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States. PARTICIPANTS: 92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine. INTERVENTION: Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone. MAIN OUTCOMES AND MEASURES: The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case. RESULTS: Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8). CONCLUSIONS AND RELEVANCE: LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases. TRIAL REGISTRATION: ClinicalTrials.gov Identifier: NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423.",Goh E; Gallo R; Strong E; Weng Y; Kerman H; Freed J; Cool JA; Kanjee Z; Lane KP; Parsons AS; Ahuja N; Horvitz E; Yang D; Milstein A; Olson APJ; Hom J; Chen JH; Rodman A
38475802,Assessing the research landscape and clinical utility of large language models: a scoping review.,2024,BMC medical informatics and decision making,,,,"IMPORTANCE: Large language models (LLMs) like OpenAI's ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base. OBJECTIVE: This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications. EVIDENCE REVIEW: We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations. FINDINGS: Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs' effectiveness and feasibility. CONCLUSIONS AND RELEVANCE: This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs' potential in enhancing healthcare delivery.",Park YJ; Pillai A; Deng J; Guo E; Gupta M; Paget M; Naugler C
38484238,Evaluation of ChatGPT for Pelvic Floor Surgery Counseling.,2024,"Urogynecology (Philadelphia, Pa.)",,,,"IMPORTANCE: Large language models are artificial intelligence applications that can comprehend and produce human-like text and language. ChatGPT is one such model. Recent advances have increased interest in the utility of large language models in medicine. Urogynecology counseling is complex and time-consuming. Therefore, we evaluated ChatGPT as a potential adjunct for patient counseling. OBJECTIVE: Our primary objective was to compare the accuracy and completeness of ChatGPT responses to information in standard patient counseling leaflets regarding common urogynecological procedures. STUDY DESIGN: Seven urogynecologists compared the accuracy and completeness of ChatGPT responses to standard patient leaflets using 5-point Likert scales with a score of 3 being ""equally accurate"" and ""equally complete,"" and a score of 5 being ""much more accurate"" and much more complete, respectively. This was repeated 3 months later to evaluate the consistency of ChatGPT. Additional analysis of the understandability and actionability was completed by 2 authors using the Patient Education Materials Assessment Tool. Analysis was primarily descriptive. First and second ChatGPT queries were compared with the Wilcoxon signed rank test. RESULTS: The median (interquartile range) accuracy was 3 (2-3) and completeness 3 (2-4) for the first ChatGPT query and 3 (3-3) and 4 (3-4), respectively, for the second query. Accuracy and completeness were significantly higher in the second query (P < 0.01). Understandability and actionability of ChatGPT responses were lower than the standard leaflets. CONCLUSIONS: ChatGPT is similarly accurate and complete when compared with standard patient information leaflets for common urogynecological procedures. Large language models may be a helpful adjunct to direct patient-provider counseling. Further research to determine the efficacy and patient satisfaction of ChatGPT for patient counseling is needed.",Johnson CM; Bradley CS; Kenne KA; Rabice S; Takacs E; Vollstedt A; Kowalski JT
38577160,Qualitative evaluation of artificial intelligence-generated weight management diet plans.,2024,Frontiers in nutrition,,,,"IMPORTANCE: The transformative potential of artificial intelligence (AI), particularly via large language models, is increasingly being manifested in healthcare. Dietary interventions are foundational to weight management efforts, but whether AI techniques are presently capable of generating clinically applicable diet plans has not been evaluated. OBJECTIVE: Our study sought to evaluate the potential of personalized AI-generated weight-loss diet plans for clinical applications by employing a survey-based assessment conducted by experts in the fields of obesity medicine and clinical nutrition. DESIGN SETTING AND PARTICIPANTS: We utilized ChatGPT (4.0) to create weight-loss diet plans and selected two control diet plans from tertiary medical centers for comparison. Dietitians, physicians, and nurse practitioners specializing in obesity medicine or nutrition were invited to provide feedback on the AI-generated plans. Each plan was assessed blindly based on its effectiveness, balanced-ness, comprehensiveness, flexibility, and applicability. Personalized plans for hypothetical patients with specific health conditions were also evaluated. MAIN OUTCOMES AND MEASURES: The primary outcomes measured included the indistinguishability of the AI diet plan from human-created plans, and the potential of personalized AI-generated diet plans for real-world clinical applications. RESULTS: Of 95 participants, 67 completed the survey and were included in the final analysis. No significant differences were found among the three weight-loss diet plans in any evaluation category. Among the 14 experts who believed that they could identify the AI plan, only five did so correctly. In an evaluation involving 57 experts, the AI-generated personalized weight-loss diet plan was assessed, with scores above neutral for all evaluation variables. Several limitations, of the AI-generated plans were highlighted, including conflicting dietary considerations, lack of affordability, and insufficient specificity in recommendations, such as exact portion sizes. These limitations suggest that refining inputs could enhance the quality and applicability of AI-generated diet plans. CONCLUSION: Despite certain limitations, our study highlights the potential of AI-generated diet plans for clinical applications. AI-generated dietary plans were frequently indistinguishable from diet plans widely used at major tertiary medical centers. Although further refinement and prospective studies are needed, these findings illustrate the potential of AI in advancing personalized weight-centric care.",Kim DW; Park JS; Sharma K; Velazquez A; Li L; Ostrominski JW; Tran T; Seitter Perez RH; Shin JH
38890161,ChatGPT and Clinical Questions on the Practical Guideline of Blepharoptosis: Reply.,2025,Aesthetic plastic surgery,,,,"In a recent Letter to the Editor authored by Daungsupawong et al. in Aesthetic Plastic Surgery, titled ""ChatGPT and Clinical Questions on the Practical Guideline of Blepharoptosis: Correspondence,"" the authors emphasized important points regarding the input language differences between input and output references. However, advanced versions, such as GPT-4, have shown marginal differences between English and Chinese inputs, possibly because of the use of larger training data. To address this issue, non-English-language-oriented large language models (LLMs) have been developed. The ability of LLMs to refer to existing references varies, with newer models, such as GPT-4, showing higher reference rates than GPT-3.5. Future research should focus on addressing the current limitations and enhancing the effectiveness of emerging LLMs in providing accurate and informative answers to medical questions across multiple languages.Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Shiraishi M; Tomioka Y; Okazaki M
38187958,"Blockchain Technology Predictions 2024: Transformations in Healthcare, Patient Identity, and Public Health.",2023,Blockchain in healthcare today,,,,"In an era characterized by the convergence of cutting-edge technologies, the world of healthcare and public health is on the brink of a profound transformation that will shape the future of medicine and wellness. This transformation is not merely an incremental step forward but a paradigm shift driven by the synergistic integration of digital twins, blockchain technology, artificial intelligence, and multi-omics platforms collectively propelling us into uncharted territory. Integrating these innovations holds the potential to rewrite the rules of engagement in clinical trials, revamp the strategies for preventing public health crises, and redefine how we manage, share, and secure healthcare data. As we embark on this journey of exploration and innovation, we find ourselves at a pivotal juncture, akin to the invention of the microscope in biology or the discovery of antibiotics in medicine. We are at the crossroads of a new era with immense promise and transformative power.",De Novi G; Sofia N; Vasiliu-Feltes I; Yan Zang C; Ricotta F
37731897,ChatGPT: promise and challenges for deployment in low- and middle-income countries.,2023,The Lancet regional health. Western Pacific,,,,"In low- and middle-income countries (LMICs), the fields of medicine and public health grapple with numerous challenges that continue to hinder patients' access to healthcare services. ChatGPT, a publicly accessible chatbot, has emerged as a potential tool in aiding public health efforts in LMICs. This viewpoint details the potential benefits of employing ChatGPT in LMICs to improve medicine and public health encompassing a broad spectrum of domains ranging from health literacy, screening, triaging, remote healthcare support, mental health support, multilingual capabilities, healthcare communication and documentation, medical training and education, and support for healthcare professionals. Additionally, we also share potential concerns and limitations associated with the use of ChatGPT and provide a balanced discussion on the opportunities and challenges of using ChatGPT in LMICs.",Wang X; Sanders HM; Liu Y; Seang K; Tran BX; Atanasov AG; Qiu Y; Tang S; Car J; Wang YX; Wong TY; Tham YC; Chung KC
38647700,De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning.,2024,Journal of computer-aided molecular design,,,,"In recent years, generative machine learning algorithms have been successful in designing innovative drug-like molecules. SMILES is a sequence-like language used in most effective drug design models. Due to data's sequential structure, models such as recurrent neural networks and transformers can design pharmacological compounds with optimized efficacy. Large language models have advanced recently, but their implications on drug design have not yet been explored. Although one study successfully pre-trained a large chemistry model (LCM), its application to specific tasks in drug discovery is unknown. In this study, the drug design task is modeled as a causal language modeling problem. Thus, the procedure of reward modeling, supervised fine-tuning, and proximal policy optimization was used to transfer the LCM to drug design, similar to Open AI's ChatGPT and InstructGPT procedures. By combining the SMILES sequence with chemical descriptors, the novel efficacy evaluation model exceeded its performance compared to previous studies. After proximal policy optimization, the drug design model generated molecules with 99.2% having efficacy pIC(50) > 7 towards the amyloid precursor protein, with 100% of the generated molecules being valid and novel. This demonstrated the applicability of LCMs in drug discovery, with benefits including less data consumption while fine-tuning. The applicability of LCMs to drug discovery opens the door for larger studies involving reinforcement-learning with human feedback, where chemists provide feedback to LCMs and generate higher-quality molecules. LCMs' ability to design similar molecules from datasets paves the way for more accessible, non-patented alternatives to drug molecules.",Ye G
38277743,Navigating the path to precision: ChatGPT as a tool in pathology.,2024,"Pathology, research and practice",,,,"In recent years, the integration of Artificial Intelligence (AI) into medicine has marked a transformative shift in healthcare practices. This study explores the application of ChatGPT 3.5, an AI-based natural language processing model, in the field of pathology, with a focus on Clinical Pathology, Histopathology, and Hematology. Leveraging a dataset of 30 clinical cases from an online source, the model's performance was evaluated, revealing moderate proficiency in data analysis and decision support. While ChatGPT demonstrated strengths in swift narrative comprehension and foundational insights, limitations were observed in generating detailed and comprehensive information. The study emphasizes the evolving nature of AI in pathology, highlighting the need for ongoing refinement and collaborative efforts between AI researchers and healthcare professionals.",Vaidyanathaiyer R; Thanigaimani GD; Arumugam P; Einstien D; Ganesan S; Surapaneni KM
40413583,[ARTIFICIAL INTELLIGENCE TOOLS AND THEIR USE IN MEDICINE CHATGPT - NOT THE ONLY PLAYER IN THE ARENA].,2025,Harefuah,,,,"In recent years, there has been a remarkable growth in the development and use of artificial intelligence tools in medicine based on large language models. This review will describe the main existing tools and their various applications for medical staff and patients. Despite its popularity, we will show that ChatGPT is not the only tool and that other tools are sometimes preferable. We will review research comparisons between different tools' effectiveness in various tasks. It will be shown that these tools lack specific performances, such as accuracy and reliability in providing information, understanding clinical context, and making diagnoses. The number of studies on these topics is small, and sometimes their presented results contradict each other. Additional quality research is needed to characterize and improve these tools and designate specific tools for different medical uses. Despite the many advantages and enormous potential inherent in these models, they should be used cautiously, as they only aid the treating physician and do not replace his knowledge, professional experience, and human judgment.",Weizman Z; Degany O; Shoenfeld Y
38903157,Exploring ChatGPT's potential in the clinical stream of neurorehabilitation.,2024,Frontiers in artificial intelligence,,,,"In several medical fields, generative AI tools such as ChatGPT have achieved optimal performance in identifying correct diagnoses only by evaluating narrative clinical descriptions of cases. The most active fields of application include oncology and COVID-19-related symptoms, with preliminary relevant results also in psychiatric and neurological domains. This scoping review aims to introduce the arrival of ChatGPT applications in neurorehabilitation practice, where such AI-driven solutions have the potential to revolutionize patient care and assistance. First, a comprehensive overview of ChatGPT, including its design, and potential applications in medicine is provided. Second, the remarkable natural language processing skills and limitations of these models are examined with a focus on their use in neurorehabilitation. In this context, we present two case scenarios to evaluate ChatGPT ability to resolve higher-order clinical reasoning. Overall, we provide support to the first evidence that generative AI can meaningfully integrate as a facilitator into neurorehabilitation practice, aiding physicians in defining increasingly efficacious diagnostic and personalized prognostic plans.",Maggio MG; Tartarisco G; Cardile D; Bonanno M; Bruschetta R; Pignolo L; Pioggia G; Calabro RS; Cerasa A
38060759,Be or Not to Be With ChatGPT?,2023,Cureus,,,,"In the ever-evolving realm of scientific research, this letter underscores the vital role of ChatGPT as an invaluable ally in manuscript creation, focusing on its remarkable grammar and spelling error correction capabilities. Furthermore, it highlights ChatGPT's efficacy in expediting the manuscript preparation process by streamlining the collection and highlighting critical scientific information. By elucidating the aim of this letter and the multifaceted benefits of ChatGPT, we aspire to illuminate the path toward a future where scientific writing achieves unparalleled efficiency and precision.",Aliyeva A; Sari E
40183181,Current applications and future perspectives of artificial intelligence in functional urology and neurourology: how far can we get?,2025,Minerva urology and nephrology,,,,"In the last few years, the scientific community has seen an increasing interest towards the potential applications of artificial intelligence in medicine and healthcare. In this context, urology represents an area of rapid development, particularly in uro-oncology, where a wide range of applications has focused on prostate cancer diagnosis. Other urological branches are also starting to explore the potential advantages of AI in the diagnostic and therapeutic process, and functional urology and neurourology are among them. Although the experiences in this sense have been quite limited so far, some AI applications have already started to show potential benefits, especially for urodynamic and imaging interpretation, as well as for the development of AI-based predictive models for treatment response. A few experiences on the use of ChatGPT to answer questions on functional urology and neurourology topics have also been reported. Conversely, AI applications in functional urology surgery remain largely unexplored. This paper provides a critical overview of the current evidence on this topic, highlighting the potential benefits for the diagnostic workflow, therapeutic evaluation and surgical training, as well as the current limitations that need to be addressed to enable the integration of this tools in the clinical practice in the future.",Gallo ML; Moriconi M; Phe V
37440696,What Should ChatGPT Mean for Bioethics?,2023,The American journal of bioethics : AJOB,,,,"In the last several months, several major disciplines have started their initial reckoning with what ChatGPT and other Large Language Models (LLMs) mean for them - law, medicine, business among other professions. With a heavy dose of humility, given how fast the technology is moving and how uncertain its social implications are, this article attempts to give some early tentative thoughts on what ChatGPT might mean for bioethics. I will first argue that many bioethics issues raised by ChatGPT are similar to those raised by current medical AI - built into devices, decision support tools, data analytics, etc. These include issues of data ownership, consent for data use, data representativeness and bias, and privacy. I describe how these familiar issues appear somewhat differently in the ChatGPT context, but much of the existing bioethical thinking on these issues provides a strong starting point. There are, however, a few ""new-ish"" issues I highlight - by new-ish I mean issues that while perhaps not truly new seem much more important for it than other forms of medical AI. These include issues about informed consent and the right to know we are dealing with an AI, the problem of medical deepfakes, the risk of oligopoly and inequitable access related to foundational models, environmental effects, and on the positive side opportunities for the democratization of knowledge and empowering patients. I also discuss how races towards dominance (between large companies and between the U.S. and geopolitical rivals like China) risk sidelining ethics.",Cohen IG
39198112,Large language model application in emergency medicine and critical care.,2024,Journal of the Formosan Medical Association = Taiwan yi zhi,,,,"In the rapidly evolving healthcare landscape, artificial intelligence (AI), particularly the large language models (LLMs), like OpenAI's Chat Generative Pretrained Transformer (ChatGPT), has shown transformative potential in emergency medicine and critical care. This review article highlights the advancement and applications of ChatGPT, from diagnostic assistance to clinical documentation and patient communication, demonstrating its ability to perform comparably to human professionals in medical examinations. ChatGPT could assist clinical decision-making and medication selection in critical care, showcasing its potential to optimize patient care management. However, integrating LLMs into healthcare raises legal, ethical, and privacy concerns, including data protection and the necessity for informed consent. Finally, we addressed the challenges related to the accuracy of LLMs, such as the risk of providing incorrect medical advice. These concerns underscore the importance of ongoing research and regulation to ensure their ethical and practical use in healthcare.",Hwai H; Ho YJ; Wang CH; Huang CH
38887420,Assessing Risk of Bias Using ChatGPT-4 and Cochrane ROB2 Tool.,2024,Medical science educator,,,,"In the world of evidence-based medicine, systematic reviews have long been the gold standard. But they have had a problem-they take forever. That is where ChatGPT-4 and automation come in. They are like a breath of fresh air, speeding things up and making the process more reliable. ChatGPT-4 is like having a super-smart assistant who can quickly assess bias risk in research studies. It is a game-changer, especially in a field where getting the latest research quickly can mean life or death for patients. Sure, it is not perfect, and we still need humans to keep an eye on things and ensure everything's ethical. But the future looks bright. With ChatGPT-4 and automation, evidence-based medicine is on the fast track to success.",Trevino-Juarez AS
38313631,Use of artificial intelligence in the field of pain medicine.,2024,World journal of clinical cases,,,,"In this editorial we comment on the article ""Potential and limitations of ChatGPT and generative artificial intelligence in medial safety education"" published in the recent issue of the World Journal of Clinical Cases. This article described the usefulness of artificial intelligence (AI) in medial safety education. Herein, we focus specifically on the use of AI in the field of pain medicine. AI technology has emerged as a powerful tool, and is expected to play an important role in the healthcare sector and significantly contribute to pain medicine as further developments are made. AI may have several applications in pain medicine. First, AI can assist in selecting testing methods to identify causes of pain and improve diagnostic accuracy. Entry of a patient's symptoms into the algorithm can prompt it to suggest necessary tests and possible diagnoses. Based on the latest medical information and recent research results, AI can support doctors in making accurate diagnoses and setting up an effective treatment plan. Second, AI assists in interpreting medical images. For neural and musculoskeletal disorders, imaging tests are of vital importance. AI can analyze a variety of imaging data, including that from radiography, computed tomography, and magnetic resonance imaging, to identify specific patterns, allowing quick and accurate image interpretation. Third, AI can predict the outcomes of pain treatments, contributing to setting up the optimal treatment plan. By predicting individual patient responses to treatment, AI algorithms can assist doctors in establishing a treatment plan tailored to each patient, further enhancing treatment effectiveness. For efficient utilization of AI in the pain medicine field, it is crucial to enhance the accuracy of AI decision-making by using more medical data, while issues related to the protection of patient personal information and responsibility for AI decisions will have to be addressed. In the future, AI technology is expected to be innovatively applied in the field of pain medicine. The advancement of AI is anticipated to have a positive impact on the entire medical field by providing patients with accurate and effective medical services.",Chang MC
37041067,ChatGPT: when artificial intelligence replaces the rheumatologist in medical writing.,2023,Annals of the rheumatic diseases,,,,"In this editorial we discuss the place of artificial intelligence (AI) in the writing of scientific articles and especially editorials. We asked chatGPT << to write an editorial for Annals of Rheumatic Diseases about how AI may replace the rheumatologist in editorial writing >>. chatGPT's response is diplomatic and describes AI as a tool to help the rheumatologist but not replace him. AI is already used in medicine, especially in image analysis, but the domains are infinite and it is possible that AI could quickly help or replace rheumatologists in the writing of scientific articles. We discuss the ethical aspects and the future role of rheumatologists.",Verhoeven F; Wendling D; Prati C
38155290,Dr. GAI: Significance of Generative AI in Plastic Surgery.,2025,Aesthetic plastic surgery,,,,"In this letter to the editor, I offer a critique of the article titled ""Consulting the Digital Doctor: Google Versus ChatGPT as Sources of Information on Breast Implant-Associated Anaplastic Large Cell Lymphoma and Breast Implant Illness."" While acknowledging the authors' pioneering effort to compare informational outputs from Google and a generative AI (GAI)-ChatGPT, I raise concerns about the methodology, lack of rigorous validation, potential biases, and the overstatement of findings. The letter suggests that the authors' conclusions about the superiority of ChatGPT in providing high-quality medical information may be premature, given the limitations of the study design and the evolving nature of artificial intelligence (AI) technology.No Level Assigned This journal requires that authors assign a level of evidence to each submission to which Evidence-Based Medicine rankings are applicable. This excludes Review Articles, Book Reviews, and manuscripts that concern Basic Science, Animal Studies, Cadaver Studies, and Experimental Studies. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266.",Ray PP
38849691,The Future Role of Radiologists in the Artificial Intelligence-Driven Hospital.,2024,Annals of biomedical engineering,,,,"Increasing population and healthcare costs make changes in the healthcare system necessary. This article deals with ChatGPT's perspective on the future role of radiologists in the AI-driven hospital. This perspective will be augmented by further considerations by the author. AI-based imaging technologies and chatbots like ChatGPT can help improve radiologists' performance and workflow in the future AI-driven hospital. Although basic radiological examinations could be delivered without needing a radiologist, sophisticated imaging procedures will still need the expert opinion of a radiologist.",Sedaghat S
38314912,ChIP-GPT: a managed large language model for robust data extraction from biomedical database records.,2024,Briefings in bioinformatics,,,,"Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.",Cinquin O
38082605,ChatGPT for phenotypes extraction: one model to rule them all?,2023,Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference,,,,"Information Extraction (IE) is a core task in Natural Language Processing (NLP) where the objective is to identify factual knowledge in textual documents (often unstructured), and feed downstream use cases with the resulting output. In genomic medicine for instance, being able to extract the most precise list of phenotypes associated to a patient allows to improve genetic disease diagnostic, which represents a vital step in the modern deep phenotyping approach. As most of the phenotypic information lies in clinical reports, the challenge is to build an IE pipeline to automatically recognize phenotype concepts from free-text notes. A new machine learning paradigm around large language models (LLM) has given rise of an increasing number of academic works on this topic lately, where sophisticated combinations of different technics have been employed to improve the phenotypes extraction accuracy. Even more recently released, the ChatGPT(1) application nevertheless raises the question of the relevance of these approches compared to this new generic one based on an instruction-oriented LLM. In this paper, we propose a rigorous evaluation of ChatGPT and the current state-of-the-art solutions on this specific task, and discuss the possible impacts and the technical evolutions to consider in the medical domain.Clinical relevance- Deep phenotyping on electronic health records has proven its ability to improve genetic diagnosis by clinical exomes [10]. Thus, comparing state-of-the-art solutions in order to derive insights and improving research paths is essential.",Labbe T; Castel P; Sanner JM; Saleh M
39561352,Evaluating the concordance of ChatGPT and physician recommendations for bariatric surgery.,2025,Canadian journal of physiology and pharmacology,,,,"Integrating artificial intelligence (AI) into healthcare prompts the need to measure its proficiency relative to human experts. This study evaluates the proficiency of ChatGPT, an OpenAI language model, in offering guidance concerning bariatric surgery compared to bariatric surgeons. Five clinical scenarios representative of diverse bariatric surgery situations were given to American Society for Metabolic and Bariatric Surgery (ASMBS)-accredited bariatric surgeons and ChatGPT. Both groups proposed medical or surgical management for the patients depicted in each scenario. The outcomes from both the surgeons and ChatGPT were examined and matched with the clinical benchmarks set by the ASMBS. There was a high degree of agreement between ChatGPT and physicians on the three simpler clinical scenarios. There was a positive correlation between physicians' and ChatGPT answers for not recommending surgery. ChatGPT's advice aligned with ASMBS guidelines 60% of the time, in contrast to bariatric surgeons, who consistently aligned with the guidelines 100% of the time. ChatGPT showcases potential in offering guidance on bariatric surgery, but it does not have the comprehensive and personalized perspective that doctors exhibit consistently. Enhancing AI's training on intricate patient situations will bolster its role in the medical field.",Kahlon S; Sleet M; Sujka J; Docimo S; DuCoin C; Dimou F; Mhaskar R
38657295,Comparison of ChatGPT version 3.5 & 4 for utility in respiratory medicine education using clinical case scenarios.,2024,Respiratory medicine and research,,,,"Integration of ChatGPT in Respiratory medicine presents a promising avenue for enhancing clinical practice and pedagogical approaches. This study compares the performance of ChatGPT version 3.5 and 4 in respiratory medicine, emphasizing its potential in clinical decision support and medical education using clinical cases. Results indicate moderate performance highlighting limitations in handling complex case scenarios. Compared to ChatGPT 3.5, version 4 showed greater promise as a pedagogical tool, providing interactive learning experiences. While serving as a preliminary decision support tool clinically, caution is advised, stressing the need for ongoing validation. Future research should refine its clinical capabilities for optimal integration into medical education and practice.",Balasanjeevi G; Surapaneni KM
37336616,How large language models can augment perioperative medicine: a daring discourse.,2023,Regional anesthesia and pain medicine,,,,"Interest in natural language processing, specifically large language models, for clinical applications has exploded in a matter of several months since the introduction of ChatGPT. Large language models are powerful and impressive. It is important that we understand the strengths and limitations of this rapidly evolving technology so that we can brainstorm its future potential in perioperative medicine. In this daring discourse, we discuss the issues with these large language models and how we should proactively think about how to leverage these models into practice to improve patient care, rather than worry that it may take over clinical decision-making. We review three potential major areas in which it may be used to benefit perioperative medicine: (1) clinical decision support and surveillance tools, (2) improved aggregation and analysis of research data related to large retrospective studies and application in predictive modeling, and (3) optimized documentation for quality measurement, monitoring and billing compliance. These large language models are here to stay and, as perioperative providers, we can either adapt to this technology or be curtailed by those who learn to use it well.",Gabriel RA; Mariano ER; McAuley J; Wu CL
38599742,[ChatGPT is an above-average student at the Faculty of Medicine of the University of Zaragoza and an excellent collaborator in the development of teaching materials].,2024,Revista espanola de patologia : publicacion oficial de la Sociedad Espanola de Anatomia Patologica y de la Sociedad Espanola de Citologia,,,,"INTRODUCTION AND OBJECTIVE: Artificial intelligence is fully present in our lives. In education, the possibilities of its use are endless, both for students and teachers. MATERIAL AND METHODS: The capacity of ChatGPT has been explored when solving multiple choice questions based on the exam of the subject <<Anatomopathological Diagnostic and Therapeutic Procedures>> of the first call of the 2022-23 academic year. In addition, to comparing their results with those of the rest of the students presented the probable causes of incorrect answers have been evaluated. Finally, its ability to formulate new test questions based on specific instructions has been evaluated. RESULTS: ChatGPT correctly answered 47 out of 68 questions, achieving a grade higher than the course average and median. Most failed questions present negative statements, using the words <<no>>, <<false>> or <<incorrect>> in their statement. After interacting with it, the program can realize its mistake and change its initial response to the correct answer. Finally, ChatGPT can develop new questions based on a theoretical assumption or a specific clinical simulation. CONCLUSIONS: As teachers we are obliged to explore the uses of artificial intelligence and try to use it to our benefit. Carrying out tasks that involve significant consumption, such as preparing multiple-choice questions for content evaluation, is a good example.",Cabanuz C; Garcia-Garcia M
40322390,Performance of Large Language Models (ChatGPT and Gemini Advanced) in Gastrointestinal Pathology and Clinical Review of Applications in Gastroenterology.,2025,Cureus,,,,"Introduction Artificial intelligence (AI) chatbots have been widely tested in their performance on various examinations, with limited data on their performance in clinical scenarios. The role of Chat Generative Pre-Trained Transformer (ChatGPT) (OpenAI, San Francisco, California, United States) and Gemini Advanced (Google LLC, Mountain View, California, United States) in multiple aspects of gastroenterology including answering patient questions, providing medical advice, and as tools to potentially assist healthcare providers has shown some promise, though associated with many limitations. We aimed to study the performance of ChatGPT-4.0, ChatGPT-3.5, and Gemini Advanced across 20 clinicopathologic scenarios in the unexplored realm of gastrointestinal pathology. Materials and methods Twenty clinicopathological scenarios in gastrointestinal pathology were provided to these three large language models. Two fellowship-trained pathologists independently assessed their responses, evaluating both the diagnostic accuracy and confidence of the models. The results were then compared using the chi-squared test. The study also evaluated each model's ability in four key areas, namely, (1) ability to provide differential diagnoses, (2) interpretation of immunohistochemical stains, (3) ability to deliver a concise final diagnosis, (4) and explanation provided for the thought process, using a five-point scoring system. The mean, median score+/-standard deviation (SD), and interquartile ranges were calculated. A comparative analysis of these four parameters across ChatGPT-4.0, ChatGPT-3.5, and Gemini Advanced was conducted using the Mann-Whitney U test. A p-value of <0.05 was considered statistically significant. Other parameters evaluated were the ability to provide a tumor, node, and metastasis (TNM) stage and the incidence of pseudo-references ""hallucinations"" while citing reference material. Results Gemini Advanced (diagnostic accuracy: p=0.01; providing differential diagnosis: p=0.03) and ChatGPT-4.0 (interpretation of immunohistochemistry (IHC) stains: p=0.001; providing differential diagnosis: p=0.002) performed significantly better in certain realms than ChatGPT-3.5, indicating continuously improving data training sets. However, the mean performances of ChatGPT-4.0 and Gemini Advanced ranged between 3.0 and 3.7 and were at best classified as average. None of the models could provide the accurate TNM staging for these clinical scenarios, with 25-50% citing references that do not exist (hallucinations). Conclusion This study indicated that though these models are evolving, they need human supervision and definite improvements before being used in clinical medicine. This is the first study of its kind in gastrointestinal pathology to the best of our knowledge.",Jain S; Chakraborty B; Agarwal A; Sharma R
39564056,A Pilot Study of Medical Student Opinions on Large Language Models.,2024,Cureus,,,,"Introduction Artificial intelligence (AI) has long garnered significant interest in the medical field. Large language models (LLMs) have popularized the use of AI for the public through chatbots such as ChatGPT and have become an easily accessible and recognizable medical resource for medical students. Here, we investigate how medical students are currently utilizing LLM-based tools throughout medical education and examine medical student perception of these tools. Methods A cross-sectional survey was administered to current medical students at the University of Florida College of Medicine (UFCOM) in January 2024 discussing the utilization of AI and LLM tools and perspectives on the current and future role of AI in medicine. Results All 102 respondents reported having heard of LLM-based chatbots such as ChatGPT, Bard, Bing Chat, and Claude. Sixty-nine percent (69%; 70/102) of respondents reported having used them for medical-related purposes at least once a month. Seventy-seven point one percent (77.1%; 54/70) reported the information provided by them to be very accurate or somewhat accurate, and 80% (55/70) reported that they were likely to continue using them in their future medical practice. Those with some baseline understanding of and exposure to AI were 3.26 (p=0.020) and 4.30 (p=0.002) times more likely to have used an LLM-based chatbot, respectively, and 5.06 (p=0.021) and 3.38 (p=0.039) times more likely to cross-check information obtained from them, respectively, compared to those with little to no baseline understanding or exposure. Furthermore, those with some exposure to AI in medical school were 2.70 (p=0.039) and 4.61 (p=0.0004) times more likely to trust AI with clinical decision-making currently and in the next 5 years, respectively, than those with little to no exposure. Those who had used an LLM-based chatbot were 4.31 (p=0.019) times more likely to trust AI with clinical decision-making currently compared to those who had not used one. Conclusion LLM-based chatbots, such as ChatGPT, are not only making their way into the medical student repertoire of study resources but are also being utilized in the setting of patient care and research. Medical students who participated in the survey generally had a positive perception of LLM-based chatbots and reported they were likely to continue using them in the future. Previous AI knowledge and exposure correlated with more conscientious use of these tools such as cross-checking information. Combined with our finding that all respondents believed AI should be taught in the medical curriculum, our study highlights a key opportunity in medical education to acclimate medical students to AI now.",Xu AY; Piranio VS; Speakman S; Rosen CD; Lu S; Lamprecht C; Medina RE; Corrielus M; Griffin IT; Chatham CE; Abchee NJ; Stribling D; Huynh PB; Harrell H; Shickel B; Brennan M
38618358,Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level.,2024,Cureus,,,,"Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.",Lum ZC; Collins DP; Dennison S; Guntupalli L; Choudhary S; Saiz AM; Randall RL
37484787,ChatGPT's Ability to Assess Quality and Readability of Online Medical Information: Evidence From a Cross-Sectional Study.,2023,Cureus,,,,"Introduction Artificial Intelligence (AI) platforms have gained widespread attention for their distinct ability to generate automated responses to various prompts. However, its role in assessing the quality and readability of a provided text remains unclear. Thus, the purpose of this study is to evaluate the proficiency of the conversational generative pre-trained transformer (ChatGPT) in utilizing the DISCERN tool to evaluate the quality of online content regarding shock wave therapy for erectile dysfunction. Methods Websites were generated using a Google search of ""shock wave therapy for erectile dysfunction"" with location filters disabled. Readability was analyzed using Readable software (Readable.com, Horsham, United Kingdom). Quality was assessed independently by three reviewers using the DISCERN tool. The same plain text files collected were inputted into ChatGPT to determine whether they produced comparable metrics for readability and quality. Results The study results revealed a notable disparity between ChatGPT's readability assessment and that obtained from a reliable tool, Readable.com (p<0.05). This indicates a lack of alignment between ChatGPT's algorithm and that of established tools, such as Readable.com. Similarly, the DISCERN score generated by ChatGPT differed significantly from the scores generated manually by human evaluators (p<0.05), suggesting that ChatGPT may not be capable of accurately identifying poor-quality information sources regarding shock wave therapy as a treatment for erectile dysfunction. Conclusion ChatGPT's evaluation of the quality and readability of online text regarding shockwave therapy for erectile dysfunction differs from that of human raters and trusted tools. Therefore, ChatGPT's current capabilities were not sufficient for reliably assessing the quality and readability of textual content. Further research is needed to elucidate the role of AI in the objective evaluation of online medical content in other fields. Continued development in AI and incorporation of tools such as DISCERN into AI software may enhance the way patients navigate the web in search of high-quality medical content in the future.",Golan R; Ripps SJ; Reddy R; Loloi J; Bernstein AP; Connelly ZM; Golan NS; Ramasamy R
38803743,Accuracy of ChatGPT in Neurolocalization.,2024,Cureus,,,,"Introduction ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States) is an artificial intelligence (AI) chatbot with advanced communication skills and a massive knowledge database. However, its application in medicine, specifically in neurolocalization, necessitates clinical reasoning in addition to deep neuroanatomical knowledge. This article examines ChatGPT's capabilities in neurolocalization. Methods Forty-six text-based neurolocalization case scenarios were presented to ChatGPT-3.5 from November 6th, 2023, to November 16th, 2023. Seven neurosurgeons evaluated ChatGPT's responses to these cases, utilizing a 5-point scoring system recommended by ChatGPT, to score the accuracy of these responses. Results ChatGPT-3.5 achieved an accuracy score of 84.8% in generating ""completely correct"" and ""mostly correct"" responses. ANOVA analysis suggested a consistent scoring approach between different evaluators. The mean length of the case text was 69.8 tokens (SD 20.8). Conclusion While this accuracy score is promising, it is not yet reliable for routine patient care. We recommend keeping interactions with ChatGPT concise, precise, and simple to improve response accuracy. As AI continues to evolve, it will hold significant and innovative breakthroughs in medicine.",Dabbas WF; Odeibat YM; Alhazaimeh M; Hiasat MY; Alomari AA; Marji A; Samara QA; Ibrahim B; Al Arabiyat RM; Momani G
38021639,"Is ChatGPT's Knowledge and Interpretative Ability Comparable to First Professional MBBS (Bachelor of Medicine, Bachelor of Surgery) Students of India in Taking a Medical Biochemistry Examination?",2023,Cureus,,,,"Introduction ChatGPT is a large language model (LLM)-based chatbot that uses natural language processing to create humanlike conversational dialogue. It has created a significant impact on the entire global landscape, especially in sectors like finance and banking, e-commerce, education, legal, human resources (HR), and recruitment since its inception. There have been multiple ongoing controversies regarding the seamless integration of ChatGPT with the healthcare system because of its factual accuracy, lack of experience, lack of clarity, expertise, and above all, lack of empathy. Our study seeks to compare ChatGPT's knowledge and interpretative abilities with those of first-year medical students in India in the subject of medical biochemistry. Materials and methods A total of 79 questions (40 multiple choice questions and 39 subjective questions) of medical biochemistry were set for Phase 1, block II term examination. Chat GPT was enrolled as the 101st student in the class. The questions were entered into ChatGPT's interface and responses were noted. The response time for the multiple-choice questions (MCQs) asked was also noted. The answers given by ChatGPT and 100 students of the class were checked by two subject experts, and marks were given according to the quality of answers. Marks obtained by the AI chatbot were compared with the marks obtained by the students. Results ChatGPT scored 140 marks out of 200 and outperformed almost all the students and ranked fifth in the class. It scored very well in information-based MCQs (92%) and descriptive logical reasoning (80%), whereas performed poorly in descriptive clinical scenario-based questions (52%). In terms of time taken to respond to the MCQs, it took significantly more time to answer logical reasoning MCQs than simple information-based MCQs (3.10+/-0.882 sec vs. 2.02+/-0.477 sec, p<0.005). Conclusions ChatGPT was able to outperform almost all the students in the subject of medical biochemistry. If the ethical issues are dealt with efficiently, these LLMs have a huge potential to be used in teaching and learning methods of modern medicine by students successfully.",Ghosh A; Maini Jindal N; Gupta VK; Bansal E; Kaur Bajwa N; Sett A
39156271,Learning the Randleman Criteria in Refractive Surgery: Utilizing ChatGPT-3.5 Versus Internet Search Engine.,2024,Cureus,,,,"Introduction Large language models such as OpenAI's (San Francisco, CA) ChatGPT-3.5 hold immense potential to augment self-directed learning in medicine, but concerns have risen regarding its accuracy in specialized fields. This study compares ChatGPT-3.5 with an internet search engine in their ability to define the Randleman criteria and its five parameters within a self-directed learning environment. Methods Twenty-three medical students gathered information on the Randleman criteria. Each student was allocated 10 minutes to interact with ChatGPT-3.5, followed by 10 minutes to search the internet independently. Each ChatGPT-3.5 conversation, student summary, and internet reference were subsequently analyzed for accuracy, efficiency, and reliability. Results ChatGPT-3.5 provided the correct definition for 26.1% of students (6/23, 95% CI: 12.3% to 46.8%), while an independent internet search resulted in sources containing the correct definition for 100% of students (23/23, 95% CI: 87.5% to 100%, p = 0.0001). ChatGPT-3.5 incorrectly identified the Randleman criteria as a corneal ectasia staging system for 17.4% of students (4/23), fabricated a ""Randleman syndrome"" for 4.3% of students (1/23), and gave no definition for 52.2% of students (12/23). When a definition was given (47.8%, 11/23), a median of two of the five correct parameters was provided along with a median of two additional falsified parameters. Conclusion Internet search engine outperformed ChatGPT-3.5 in providing accurate and reliable information on the Randleman criteria. ChatGPT-3.5 gave false information, required excessive prompting, and propagated misunderstandings. Learners should exercise discernment when using ChatGPT-3.5. Future initiatives should evaluate the implementation of prompt engineering and updated large-language models.",Tuttle JJ; Moshirfar M; Garcia J; Altaf AW; Omidvarnia S; Hoopes PC
39822447,A Study of Orthopedic Patient Leaflets and Readability of AI-Generated Text in Foot and Ankle Surgery (SOLE-AI).,2024,Cureus,,,,"Introduction The internet age has broadened the horizons of modern medicine, and the ever-increasing scope of artificial intelligence (AI) has made information about healthcare, common pathologies, and available treatment options much more accessible to the wider population. Patient autonomy relies on clear, accurate, and user-friendly information to give informed consent to an intervention. Our paper aims to outline the quality, readability, and accuracy of readily available information produced by AI relating to common foot and ankle procedures. Materials and methods A retrospective qualitative analysis of procedure-specific information relating to three common foot and ankle orthopedic procedures: ankle arthroscopy, ankle arthrodesis/fusion, and a gastrocnemius lengthening procedure was undertaken. Patient information leaflets (PILs) created by The British Orthopaedic Foot and Ankle Society (BOFAS) were compared to ChatGPT responses for readability, quality, and accuracy of information. Four language tools were used to assess readability: the Flesch-Kincaid reading ease (FKRE) score, the Flesch-Kincaid grade level (FKGL), the Gunning fog score (GFS), and the simple measure of gobbledygook (SMOG) index. Quality and accuracy were determined by using the DISCERN tool by five independent assessors. Results PILs produced by AI had significantly lower FKRE scores when compared to BOFAS -40.4 (SD: +/-7.69) compared to 91.9 (SD: +/-2.24) (p </= 0.0001), indicating poor readability of AI-generated text. DISCERN scoring highlighted a statistically significant improvement in accuracy and quality of human-generated information across two PILs with a mean score of 55.06 compared to 46.8. FKGL scoring indicated that the required grade of students to understand AI responses was consistently higher than compared to information leaflets at 11.7 versus 1.1 (p </= 0.0001). The number of years spent in education required to understand the ChatGPT-produced PILs was significantly higher in both GFS (14.46 vs. 2.0 years) (p < 0.0001) and SMOG (11.0 vs. 3.06 years) (p < 0.0001). Conclusion Despite significant advances in the implementation of AI in surgery, AI-generated PILs for common foot and ankle surgical procedures currently lack sufficient quality, depth, and readability - this risks leaving patients misinformed regarding upcoming procedures. We conclude that information from trusted professional bodies should be used to complement a clinical consultation, as there currently lacks sufficient evidence to support the routine implementation of AI-generated information into the consent process.",Jaques A; Abdelghafour K; Perkins O; Nuttall H; Haidar O; Johal K
40271334,Dr. Chatbot: Investigating the Quality and Quantity of Responses Generated by Three AI Chatbots to Prompts Regarding Carpal Tunnel Syndrome.,2025,Cureus,,,,"Introduction The objective of this study is to investigate the amount and accuracy of statements provided in answers by AI chatbots to prompts about carpal tunnel syndrome. To the authors' knowledge, this is the first study to assess the answers provided by OpenAI ChatGPT-4o model, AMBOSS GPT, and Google Gemini to common patient-based questions regarding carpal tunnel, using UpToDate as a standard reference. Objective To determine which chatbot produces the most medically accurate responses. The authors hypothesize that the paid upgrade to Chat-GPT-4o (AMBOSS GPT) will have the most accurate responses compared to the two free chatbots, ChatGPT-4o and Google Gemini 1.5 Flash model. Main outcome measures The number of statements generated by each chatbot and the percentage of those statements that can be directly verified using exact quotations from supporting information available on UpToDate as of December 2024. Results There was a significant difference in terms of the number of average statements provided per prompt by the three chatbots, as GPT-4o produced 8.9 more statements compared to AMBOSS GPT (p = 0.0081916), GPT-4o produced 19.65 more statements compared to Gemini (p = 0.0000001), and AMBOSS GPT produced 10.75 more statements than Gemini (p = <0.0000001). There was also a significant difference in terms of the percentage of information provided by each chatbot that was able to be verified in AMBOSS GPT (85.97%) vs. GPT-4o (71.76%) and Gemini (73.53%), with differences of 14.22% (p = 0.0000002) and 12.44% (p = 0.0003969), respectively. Conclusions This study demonstrated that when looking at the three AI chatbots, AMBOSS GPT, GPT-4o, and Google Gemini, GPT-4o produced the most information per prompt; however, AMBOSS GPT provided a larger percentage of information that was able to be found supported within information available in UpToDate.",Buchman ZJ; Savarino VR; Vinarski BM; Jay LF; Phrathep D; Boesler D
37457604,Advancing Artificial Intelligence for Clinical Knowledge Retrieval: A Case Study Using ChatGPT-4 and Link Retrieval Plug-In to Analyze Diabetic Ketoacidosis Guidelines.,2023,Cureus,,,,"Introduction This case study aimed to enhance the traceability and retrieval accuracy of ChatGPT-4 in medical text by employing a step-by-step systematic approach. The focus was on retrieving clinical answers from three international guidelines on diabetic ketoacidosis (DKA). Methods A systematic methodology was developed to guide the retrieval process. One question was asked per guideline to ensure accuracy and maintain referencing. ChatGPT-4 was utilized to retrieve answers, and the 'Link Reader' plug-in was integrated to facilitate direct access to webpages containing the guidelines. Subsequently, ChatGPT-4 was employed to compile answers while providing citations to the sources. This process was iterated 30 times per question to ensure consistency. In this report, we present our observations regarding the retrieval accuracy, consistency of responses, and the challenges encountered during the process. Results Integrating ChatGPT-4 with the 'Link Reader' plug-in demonstrated notable traceability and retrieval accuracy benefits. The AI model successfully provided relevant and accurate clinical answers based on the analyzed guidelines. Despite occasional challenges with webpage access and minor memory drift, the overall performance of the integrated system was promising. The compilation of the answers was also impressive and held significant promise for further trials. Conclusion The findings of this case study contribute to the utilization of AI text-generation models as valuable tools for medical professionals and researchers. The systematic approach employed in this case study and the integration of the 'Link Reader' plug-in offer a framework for automating medical text synthesis, asking one question at a time before compilation from different sources, which has led to improving AI models' traceability and retrieval accuracy. Further advancements and refinement of AI models and integration with other software utilities hold promise for enhancing the utility and applicability of AI-generated recommendations in medicine and scientific academia. These advancements have the potential to drive significant improvements in everyday medical practice.",Hamed E; Sharif A; Eid A; Alfehaidi A; Alberry M
38435177,Generative Artificial Intelligence in Patient Education: ChatGPT Takes on Hypertension Questions.,2024,Cureus,,,,"Introduction Uncontrolled hypertension significantly contributes to the development and deterioration of various medical conditions, such as myocardial infarction, chronic kidney disease, and cerebrovascular events. Despite being the most common preventable risk factor for all-cause mortality, only a fraction of affected individuals maintain their blood pressure in the desired range. In recent times, there has been a growing reliance on online platforms for medical information. While providing a convenient source of information, differentiating reliable from unreliable information can be daunting for the layperson, and false information can potentially hinder timely diagnosis and management of medical conditions. The surge in accessibility of generative artificial intelligence (GeAI) technology has led to increased use in obtaining health-related information. This has sparked debates among healthcare providers about the potential for misuse and misinformation while recognizing the role of GeAI in improving health literacy. This study aims to investigate the accuracy of AI-generated information specifically related to hypertension. Additionally, it seeks to explore the reproducibility of information provided by GeAI. Method A nonhuman-subject qualitative study was devised to evaluate the accuracy of information provided by ChatGPT regarding hypertension and its secondary complications. Frequently asked questions on hypertension were compiled by three study staff, internal medicine residents at an ACGME-accredited program, and then reviewed by a physician experienced in treating hypertension, resulting in a final set of 100 questions. Each question was posed to ChatGPT three times, once by each study staff, and the majority response was then assessed against the recommended guidelines. A board-certified internal medicine physician with over eight years of experience further reviewed the responses and categorized them into two classes based on their clinical appropriateness: appropriate (in line with clinical recommendations) and inappropriate (containing errors). Descriptive statistical analysis was employed to assess ChatGPT responses for accuracy and reproducibility. Result Initially, a pool of 130 questions was gathered, of which a final set of 100 questions was selected for the purpose of this study. When assessed against acceptable standard responses, ChatGPT responses were found to be appropriate in 92.5% of cases and inappropriate in 7.5%. Furthermore, ChatGPT had a reproducibility score of 93%, meaning that it could consistently reproduce answers that conveyed similar meanings across multiple runs. Conclusion ChatGPT showcased commendable accuracy in addressing commonly asked questions about hypertension. These results underscore the potential of GeAI in providing valuable information to patients. However, continued research and refinement are essential to evaluate further the reliability and broader applicability of ChatGPT within the medical field.",Almagazzachi A; Mustafa A; Eighaei Sedeh A; Vazquez Gonzalez AE; Polianovskaia A; Abood M; Abdelrahman A; Muyolema Arce V; Acob T; Saleem B
39050297,ChatGPT Versus National Eligibility cum Entrance Test for Postgraduate (NEET PG).,2024,Cureus,,,,"Introduction With both suspicion and excitement, artificial intelligence tools are being integrated into nearly every aspect of human existence, including medical sciences and medical education. The newest large language model (LLM) in the class of autoregressive language models is ChatGPT. While ChatGPT's potential to revolutionize clinical practice and medical education is under investigation, further research is necessary to understand its strengths and limitations in this field comprehensively. Methods Two hundred National Eligibility cum Entrance Test for Postgraduate 2023 questions were gathered from various public education websites and individually entered into Microsoft Bing (GPT-4 Version 2.2.1). Microsoft Bing Chatbot is currently the only platform incorporating all of GPT-4's multimodal features, including image recognition. The results were subsequently analyzed. Results Out of 200 questions, ChatGPT-4 answered 129 correctly. The most tested specialties were medicine (15%), obstetrics and gynecology (15%), general surgery (14%), and pathology (10%), respectively. Conclusion This study sheds light on how well the GPT-4 performs in addressing the NEET-PG entrance test. ChatGPT has potential as an adjunctive instrument within medical education and clinical settings. Its capacity to react intelligently and accurately in complicated clinical settings demonstrates its versatility.",Paul S; Govindaraj S; Jk J
39592896,Transforming Alzheimer's Digital Caregiving through Large Language Models.,2024,Current Alzheimer research,,,,"INTRODUCTION/OBJECTIVE: Alzheimer's Disease and Related Dementias (AD/ADRD) present significant caregiving challenges, with increasing burdens on informal caregivers. This study examines the potential of AI-driven Large Language Models (LLMs) in developing digital caregiving strategies for AD/ADRD. The objectives include analyzing existing caregiving education materials (CEMs) and mobile application descriptions (MADs) and aligning key caregiving tasks with digital functions across different stages of disease progression. METHODS: We analyzed 38 CEMs from the National Library of Medicine's MedlinePlus, along with associated hyperlinked web resources, and 57 MADs focused on AD digital caregiving. Using ChatGPT 3.5, essential caregiving tasks were extracted and matched with digital functionalities suitable for each stage of AD progression, while also highlighting digital literacy requirements for caregivers. RESULTS: The analysis categorizes AD caregiving into 4 stages-Pre-Clinical, Mild, Moderate, and Severe-identifying key tasks, such as behavior monitoring, daily assistance, direct supervision, and ensuring a safe environment. These tasks were supported by digital aids, including memory- enhancing apps, Global Positioning System (GPS) tracking, voice-controlled devices, and advanced GPS tracking for comprehensive care. Additionally, 6 essential digital literacy skills for AD/ADRD caregiving were identified: basic digital skills, communication, information management, safety and privacy, healthcare knowledge, and caregiver coordination, highlighting the need for tailored training. CONCLUSION: The findings advocate for an LLM-driven strategy in designing digital caregiving interventions, particularly emphasizing a novel paradigm in AD/ADRD support, offering adaptive assistance that evolves with caregivers' needs, thereby enhancing their shared decision-making and patient care capabilities.",Kim S; Han DY; Bae J
37271011,ChatGPT in medical imaging higher education.,2023,"Radiography (London, England : 1995)",,,,"INTRODUCTION: Academic integrity among radiographers and nuclear medicine technologists/scientists in both higher education and scientific writing has been challenged by advances in artificial intelligence (AI). The recent release of ChatGPT, a chatbot powered by GPT-3.5 capable of producing accurate and human-like responses to questions in real-time, has redefined the boundaries of academic and scientific writing. These boundaries require objective evaluation. METHOD: ChatGPT was tested against six subjects across the first three years of the medical radiation science undergraduate course for both exams (n = 6) and written assignment tasks (n = 3). ChatGPT submissions were marked against standardised rubrics and results compared to student cohorts. Submissions were also evaluated by Turnitin for similarity and AI scores. RESULTS: ChatGPT powered by GPT-3.5 performed below the average student performance in all written tasks with an increasing disparity as subjects advanced. ChatGPT performed better than the average student in foundation or general subject examinations where shallow responses meet learning outcomes. For discipline specific subjects, ChatGPT lacked the depth, breadth, and currency of insight to provide pass level answers. CONCLUSION: ChatGPT simultaneously poses a risk to academic integrity in writing and assessment while affording a tool for enhanced learning environments. These risks and benefits are likely to be restricted to learning outcomes of lower taxonomies. Both risks and benefits are likely to be constrained by higher order taxonomies. IMPLICATIONS FOR PRACTICE: ChatGPT powered by GPT3.5 has limited capacity to support student cheating, introduces errors and fabricated information, and is readily identified by software as AI generated. Lack of depth of insight and appropriateness for professional communication also limits capacity as a learning enhancement tool.",Currie G; Singh C; Nelson T; Nabasenja C; Al-Hayek Y; Spuur K
39938135,Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis.,2025,Journal of clinical neuroscience : official journal of the Neurosurgical Society of Australasia,,,,"INTRODUCTION: Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is often discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (formerly known as Bard, Google DeepMind), have shown promise in text-based tasks but remain under explored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making. METHODS: A total of 250 image-based questions selected from two neurosurgical board review books were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses. RESULTS: ChatGPT-4.0 accurately answered 135/250 (54.0 %) of questions, while Gemini correctly answered 24/250 (8.6 %). ChatGPT accurately answered 85/129 (65.9 %) of The Comprehensive Neurosurgery Board Preparation Book, and 50/121 (41.3 %) of the Neurosurgery Board Review book. Gemini answered 23/129 (17.8 %) of The Comprehensive Neurosurgery Board Preparation Book, and only 1 of 121 (0.8 %) questions from Neurosurgery Board Review. When comparing the ability of ChatGPT-4.0o vs. Gemini, ChatGPT significantly outperformed Gemini in accuracy (p < 0.0001). The overall refusal rate for Gemini in answering questions was 102/250 (40.8 %), while ChatGPT attempted to answer all questions. CONCLUSIONS: While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.",McNulty AM; Valluri H; Gajjar AA; Custozzo A; Field NC; Paul AR
38173951,Comparing Artificial Intelligence and Senior Residents in Oral Lesion Diagnosis: A Comparative Study.,2024,Cureus,,,,"INTRODUCTION: Artificial intelligence (AI) is a field of computer science that seeks to build intelligent machines that can carry out tasks that usually necessitate human intelligence. AI may help dentists with a variety of dental tasks, including clinical diagnosis and treatment planning. This study aims to compare the performance of AI and oral medicine residents in diagnosing different cases, providing treatment, and determining if it is reliable to assist them in their field of work. METHODS: The study conducted a comparative analysis of the responses from third- and fourth-year residents trained in Oral Medicine and Pathology at King Saud University, College of Dentistry. The residents were given a closed multiple-choice test consisting of 19 questions with four response options labeled A-D and one question with five response options labeled A-E. The test was administered via Google Forms, and each resident's response was stored electronically in an Excel sheet (Microsoft(R) Corp., Redmond, WA). The residents' answers were then compared to the responses generated by three major language models: OpenAI, Stablediffusion, and PopAI. The questions were inputted into the language models in the same format as the original test, and prior to each question, an artificial intelligence chat session was created to eliminate memory retention bias. The input was done on November 19, 2023, the same day the official multiple-choice test was administered. The study had a sample size of 20 residents trained in Oral Medicine and Pathology at King Saud University, College of Dentistry, consisting of both third-year and fourth-year residents. RESULT: The responses of three large language models (LLM), including OpenAI, Stablediffusion, and PopAI, as well as the responses of 20 senior residents for 20 clinical cases about oral lesion diagnosis. There were no significant variations observed for the remaining questions in the responses to only two questions (10%). For the remaining questions, there were no significant differences. The median (IQR) score of LLMs was 50.0 (45.0 to 60.0), with a minimum of 40 (for stable diffusion) and a maximum of 70 (for OpenAI). The median (IQR) score of senior residents was 65.0 (55.0-75.0). The highest and lowest scores of residents were 40 and 90, respectively. There was no significant difference in the percent scores of residents and LLMs (p = 0.211). The agreement level was measured using the Kappa value. The agreement among senior dental residents was observed to be weak, with a Kappa value of 0.396. In contrast, the agreement among LLMs demonstrated a moderate level, with a Kappa value of 0.622, suggesting a more cohesive alignment in responses among the artificial intelligence models. When comparing residents' responses with those generated by different OpenAI models, including OpenAI, Stablediffusion, and PopAI, the agreement levels were consistently categorized as weak, with Kappa values of 0.402, 0.381, and 0.392, respectively. CONCLUSION: What the current study reveals is that when comparing the response score, there is no significant difference, in contrast to the agreement analysis among the residents, which was low compared to the LLMs, in which it was high. Dentists should consider that AI is very beneficial in providing diagnosis and treatment and use it to assist them.",Albagieh H; Alzeer ZO; Alasmari ON; Alkadhi AA; Naitah AN; Almasaad KF; Alshahrani TS; Alshahrani KS; Almahmoud MI
40425883,Evaluating Surgical Results in Breast Cancer with Artificial Intelligence.,2025,Aesthetic plastic surgery,,,,"INTRODUCTION: Artificial intelligence (AI) is rapidly transforming healthcare, with increasing applications in surgical evaluation. In breast cancer surgery, achieving aesthetic symmetry is essential for patient satisfaction and emotional well-being. While human evaluation remains fundamental, AI-driven symmetry assessment promises objective alternatives. This study evaluates the performance of publicly available AI models in breast symmetry assessment and compares them with Pyolo8, a custom AI model developed by the authors. Additionally, the study explores the potential emotional impact and ethical considerations of AI-generated assessments in postoperative breast cancer patients. METHODS: Sixty-eight patients who underwent breast reconstruction were evaluated with the use of publicly available AI models and contrasted with an AI model developed by the authors named Pyolo8. All results were evaluated by human observers. RESULTS: ChatGPT 4o and Pyolo8 AI models showed statistically significant moderate to strong positive correlation for postoperative assessment when compared to human observers. Direct interaction between AI models and patients was censored due to concerns of misinterpretation. CONCLUSIONS: Both ChatGPT and Pyolo8 showed moderate to strong correlation with humans, but ChatGPT demonstrated superior communication skills. However, AI systems may lack the subtlety and empathy required for direct patient interactions, as vulnerable postoperative patients receiving an AI-generated symmetry assessment without appropriate clinical context may experience emotional distress or misinterpret the results. Human oversight and empathetic communication remain essential to ensure quality care while AI is increasingly integrated into medicine. LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Kenig N; Monton Echeverria J; Muntaner Vives A
38451040,Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence.,2024,Neurourology and urodynamics,,,,"INTRODUCTION: Artificial intelligence (AI) shows immense potential in medicine and Chat generative pretrained transformer (ChatGPT) has been used for different purposes in the field. However, it may not match the complexity and nuance of certain medical scenarios. This study evaluates the accuracy of ChatGPT 3.5 and 4 in providing recommendations regarding the management of postprostatectomy urinary incontinence (PPUI), considering The Incontinence After Prostate Treatment: AUA/SUFU Guideline as the best practice benchmark. MATERIALS AND METHODS: A set of questions based on the AUA/SUFU Guideline was prepared. Queries included 10 conceptual questions and 10 case-based questions. All questions were open and entered into the ChatGPT with a recommendation to limit the answer to 200 words, for greater objectivity. Responses were graded as correct (1 point); partially correct (0.5 point), or incorrect (0 point). Performances of versions 3.5 and 4 of ChatGPT were analyzed overall and separately for the conceptual and the case-based questions. RESULTS: ChatGPT 3.5 scored 11.5 out of 20 points (57.5% accuracy), while ChatGPT 4 scored 18 (90.0%; p = 0.031). In the conceptual questions, ChatGPT 3.5 provided accurate answers to six questions along with one partially correct response and three incorrect answers, with a final score of 6.5. In contrast, ChatGPT 4 provided correct answers to eight questions and partially correct answers to two questions, scoring 9.0. In the case-based questions, ChatGPT 3.5 scored 5.0, while ChatGPT 4 scored 9.0. The domains where ChatGPT performed worst were evaluation, treatment options, surgical complications, and special situations. CONCLUSION: ChatGPT 4 demonstrated superior performance compared to ChatGPT 3.5 in providing recommendations for the management of PPUI, using the AUA/SUFU Guideline as a benchmark. Continuous monitoring is essential for evaluating the development and precision of AI-generated medical information.",Pinto VBP; de Azevedo MF; Wroclawski ML; Gentile G; Jesus VLM; de Bessa Junior J; Nahas WC; Sacomani CAR; Sandhu JS; Gomes CM
39956557,Human versus artificial intelligence: evaluating ChatGPT's performance in conducting published systematic reviews with meta-analysis in chronic pain research.,2025,Regional anesthesia and pain medicine,,,,"INTRODUCTION: Artificial intelligence (AI), particularly large-language models like Chat Generative Pre-Trained Transformer (ChatGPT), has demonstrated potential in streamlining research methodologies. Systematic reviews and meta-analyses, often considered the pinnacle of evidence-based medicine, are inherently time-intensive and demand meticulous planning, rigorous data extraction, thorough analysis, and careful synthesis. Despite promising applications of AI, its utility in conducting systematic reviews with meta-analysis remains unclear. This study evaluated ChatGPT's accuracy in conducting key tasks of a systematic review with meta-analysis. METHODS: This validation study used data from a published meta-analysis on emotional functioning after spinal cord stimulation. ChatGPT-4o performed title/abstract screening, full-text study selection, and data pooling for this systematic review with meta-analysis. Comparisons were made against human-executed steps, which were considered the gold standard. Outcomes of interest included accuracy, sensitivity, specificity, positive predictive value, and negative predictive value for screening and full-text review tasks. We also assessed for discrepancies in pooled effect estimates and forest plot generation. RESULTS: For title and abstract screening, ChatGPT achieved an accuracy of 70.4%, sensitivity of 54.9%, and specificity of 80.1%. In the full-text screening phase, accuracy was 68.4%, sensitivity 75.6%, and specificity 66.8%. ChatGPT successfully pooled data for five forest plots, achieving 100% accuracy in calculating pooled mean differences, 95% CIs, and heterogeneity estimates (I(2) score and tau-squared values) for most outcomes, with minor discrepancies in tau-squared values (range 0.01-0.05). Forest plots showed no significant discrepancies. CONCLUSION: ChatGPT demonstrates modest to moderate accuracy in screening and study selection tasks, but performs well in data pooling and meta-analytic calculations. These findings underscore the potential of AI to augment systematic review methodologies, while also emphasizing the need for human oversight to ensure accuracy and integrity in research workflows.",Purewal A; Fautsch K; Klasova J; Hussain N; D'Souza RS
39963119,Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review.,2025,Frontiers in digital health,,,,"INTRODUCTION: Artificial intelligence and machine learning are popular interconnected technologies. AI chatbots like ChatGPT and Gemini show considerable promise in medical inquiries. This scoping review aims to assess the accuracy and response length (in characters) of ChatGPT and Gemini in medical applications. METHODS: The eligible databases were searched to find studies published in English from January 1 to October 20, 2023. The inclusion criteria consisted of studies that focused on using AI in medicine and assessed outcomes based on the accuracy and character count (length) of ChatGPT and Gemini. Data collected from the studies included the first author's name, the country where the study was conducted, the type of study design, publication year, sample size, medical speciality, and the accuracy and response length. RESULTS: The initial search identified 64 papers, with 11 meeting the inclusion criteria, involving 1,177 samples. ChatGPT showed higher accuracy in radiology (87.43% vs. Gemini's 71%) and shorter responses (907 vs. 1,428 characters). Similar trends were noted in other specialties. However, Gemini outperformed ChatGPT in emergency scenarios (87% vs. 77%) and in renal diets with low potassium and high phosphorus (79% vs. 60% and 100% vs. 77%). Statistical analysis confirms that ChatGPT has greater accuracy and shorter responses than Gemini in medical studies, with a p-value of <.001 for both metrics. CONCLUSION: This Scoping review suggests that ChatGPT may demonstrate higher accuracy and provide shorter responses than Gemini in medical studies.",Fattah FH; Salih AM; Salih AM; Asaad SK; Ghafour AK; Bapir R; Abdalla BA; Othman S; Ahmed SM; Hasan SJ; Mahmood YM; Kakamad FH
38188865,Evaluation of information provided to patients by ChatGPT about chronic diseases in Spanish language.,2024,Digital health,,,,"INTRODUCTION: Artificial intelligence has presented exponential growth in medicine. The ChatGPT language model has been highlighted as a possible source of patient information. This study evaluates the reliability and readability of ChatGPT-generated patient information on chronic diseases in Spanish. METHODS: Questions frequently asked by patients on the internet about diabetes mellitus, heart failure, rheumatoid arthritis (RA), chronic kidney disease (CKD), and systemic lupus erythematosus (SLE) were submitted to ChatGPT. Reliability was assessed by rating responses as (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, (4) completely incorrect, and divided between ""good"" (1 and 2) and ""bad"" (3 and 4). Readability was evaluated with the adapted Flesch and Szigriszt formulas. RESULTS: And 71.67% of the answers were ""good,"" with none qualified as ""completely incorrect."" Better reliability was observed in questions on diabetes and RA versus heart failure (p = 0.02). In readability, responses were ""moderately difficult"" (54.73, interquartile range (IQR) 51.59-58.58), with better results for CKD (median 56.1, IQR 53.5-59.1) and RA (56.4, IQR 53.7-60.7), than for heart failure responses (median 50.6, IQR 46.3-53.8). CONCLUSION: Our study suggests that the ChatGPT tool can be a reliable source of information in spanish for patients with chronic diseases with different reliability for some of them, however, it needs to improve the readability of its answers to be recommended as a useful tool for patients.",Soto-Chavez MJ; Bustos MM; Fernandez-Avila DG; Munoz OM
37596194,Evaluating the performance of ChatGPT in answering questions related to pediatric urology.,2024,Journal of pediatric urology,,,,"INTRODUCTION: Artificial intelligence is advancing in various domains, including medicine, and its progress is expected to continue in the future. OBJECTIVE: This research aimed to assess the precision and consistency of ChatGPT's responses to commonly asked inquiries related to pediatric urology. MATERIALS AND METHODS: We examined commonly posed inquiries regarding pediatric urology found on urology association websites, hospitals, and social media platforms. Additionally, we referenced the recommendations tables in the European Urology Association's (EAU) 2022 Guidelines on Pediatric Urology, which contained robust data at the strong recommendation level. All questions were systematically presented to ChatGPT's May 23 Version, and two expert urologists independently assessed and assigned scores ranging from 1 to 4 to each response. RESULTS: A hundred thirty seven questions about pediatric urology were included in the study. The answers to questions resulted in 92.0% completely correct. The completely correct rate in the questions prepared according to the strong recommendations of the EAU guideline was 93.6%. No question was answered completely wrong. The similarity rates of the answers to the repeated questions were between 93.8% and 100%. CONCLUSION: ChatGPT has provided satisfactory responses to inquiries related to pediatric urology. Despite its limitations, it is foreseeable that this continuously evolving platform will occupy a crucial position in the healthcare industry.",Caglar U; Yildiz O; Meric A; Ayranci A; Gelmis M; Sarilar O; Ozgor F
38749814,Surgeons vs ChatGPT: Assessment and Feedback Performance Based on Real Surgical Scenarios.,2024,Journal of surgical education,,,,"INTRODUCTION: Artificial intelligence tools are being progressively integrated into medicine and surgical education. Large language models, such as ChatGPT, could provide relevant feedback aimed at improving surgical skills. The purpose of this study is to assess ChatGPT s ability to provide feedback based on surgical scenarios. METHODS: Surgical situations were transformed into texts using a neutral narrative. Texts were evaluated by ChatGPT 4.0 and 3 surgeons (A, B, C) after a brief instruction was delivered: identify errors and provide feedback accordingly. Surgical residents were provided with each of the situations and feedback obtained during the first stage, as written by each surgeon and ChatGPT, and were asked to assess the utility of feedback (FCUR) and its quality (FQ). As control measurement, an Education-Expert (EE) and a Clinical-Expert (CE) were asked to assess FCUR and FQ. RESULTS: Regarding residents' evaluations, 96.43% of times, outputs provided by ChatGPT were considered useful, comparable to what surgeons' B and C obtained. Assessing FQ, ChatGPT and all surgeons received similar scores. Regarding EE's assessment, ChatGPT obtained a significantly higher FQ score when compared to surgeons A and B (p = 0.019; p = 0.033) with a median score of 8 vs. 7 and 7.5, respectively; and no difference respect surgeon C (score of 8; p = 0.2). Regarding CE s assessment, surgeon B obtained the highest FQ score while ChatGPT received scores comparable to that of surgeons A and C. When participants were asked to identify the source of the feedback, residents, CE, and EE perceived ChatGPT's outputs as human-provided in 33.9%, 28.5%, and 14.3% of cases, respectively. CONCLUSION: When given brief written surgical situations, ChatGPT was able to identify errors with a detection rate comparable to that of experienced surgeons and to generate feedback that was considered useful for skill improvement in a surgical context performing as well as surgical instructors across assessments made by general surgery residents, an experienced surgeon, and a nonsurgeon feedback expert.",Jarry Trujillo C; Vela Ulloa J; Escalona Vivas G; Grasset Escobar E; Villagran Gutierrez I; Achurra Tirado P; Varas Cohen J
39985409,ChatGPT Achieves Only Fair Agreement with ACFAS Expert Panelist Clinical Consensus Statements.,2025,Foot & ankle specialist,,,,"INTRODUCTION: As artificial intelligence (AI) becomes increasingly integrated into medicine and surgery, its applications are expanding rapidly-from aiding clinical documentation to providing patient information. However, its role in medical decision-making remains uncertain. This study evaluates an AI language model's alignment with clinical consensus statements in foot and ankle surgery. METHODS: Clinical consensus statements from the American College of Foot and Ankle Surgeons (ACFAS; 2015-2022) were collected and rated by ChatGPT-o1 as being inappropriate, neither appropriate nor inappropriate, and appropriate. Ten repetitions of the statements were entered into ChatGPT-o1 in a random order, and the model was prompted to assign a corresponding rating. The AI-generated scores were compared to the expert panel's ratings, and intra-rater analysis was performed. RESULTS: The analysis of 9 clinical consensus documents and 129 statements revealed an overall Cohen's kappa of 0.29 (95% CI: 0.12, 0.46), indicating fair alignment between expert panelists and ChatGPT. Overall, ankle arthritis and heel pain showed the highest concordance at 100%, while flatfoot exhibited the lowest agreement at 25%, reflecting variability between ChatGPT and expert panelists. Among the ChatGPT ratings, Cohen's kappa values ranged from 0.41 to 0.92, highlighting variability in internal reliability across topics. CONCLUSION: ChatGPT achieved overall fair agreement and demonstrated variable consistency when repetitively rating ACFAS expert panel clinical practice guidelines representing a variety of topics. These data reflect the need for further study of the causes, impacts, and solutions for this disparity between intelligence and human intelligence. LEVEL OF EVIDENCE: Level IV: Retrospective cohort study.",Casciato DJ; Calhoun J
38667587,AI and Ethics: A Systematic Review of the Ethical Considerations of Large Language Model Use in Surgery Research.,2024,"Healthcare (Basel, Switzerland)",,,,"INTRODUCTION: As large language models receive greater attention in medical research, the investigation of ethical considerations is warranted. This review aims to explore surgery literature to identify ethical concerns surrounding these artificial intelligence models and evaluate how autonomy, beneficence, nonmaleficence, and justice are represented within these ethical discussions to provide insights in order to guide further research and practice. METHODS: A systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Five electronic databases were searched in October 2023. Eligible studies included surgery-related articles that focused on large language models and contained adequate ethical discussion. Study details, including specialty and ethical concerns, were collected. RESULTS: The literature search yielded 1179 articles, with 53 meeting the inclusion criteria. Plastic surgery, orthopedic surgery, and neurosurgery were the most represented surgical specialties. Autonomy was the most explicitly cited ethical principle. The most frequently discussed ethical concern was accuracy (n = 45, 84.9%), followed by bias, patient confidentiality, and responsibility. CONCLUSION: The ethical implications of using large language models in surgery are complex and evolving. The integration of these models into surgery necessitates continuous ethical discourse to ensure responsible and ethical use, balancing technological advancement with human dignity and safety.",Pressman SM; Borna S; Gomez-Cabello CA; Haider SA; Haider C; Forte AJ
39667153,In the face of confounders: Atrial fibrillation detection - Practitioners vs. ChatGPT.,2025,Journal of electrocardiology,,,,"INTRODUCTION: Atrial fibrillation (AF) is the most common arrhythmia in clinical practice, yet interpretation concerns among healthcare providers persist. Confounding factors contribute to false-positive and false-negative AF diagnoses, leading to potential omissions. Artificial intelligence advancements show promise in electrocardiogram (ECG) interpretation. We sought to examine the diagnostic accuracy of ChatGPT-4omni (GPT-4o), equipped with image evaluation capabilities, in interpreting ECGs with confounding factors and compare its performance to that of physicians. METHODS: Twenty ECG cases, divided into Group A (10 cases of AF or atrial flutter) and Group B (10 cases of sinus or another atrial rhythm), were crafted into multiple-choice questions. Total of 100 practitioners (25 from each: emergency medicine, internal medicine, primary care, and cardiology) were tasked to identify the underlying rhythm. Next, GPT-4o was prompted in five separate sessions. RESULTS: GPT-4o performed inadequately, averaging 3 (+/-2) in Group A questions and 5.40 (+/-1.34) in Group B questions. Upon examining the accuracy of the total ECG questions, no significant difference was found between GPT-4o, internists, and primary care physicians (p = 0.952 and = 0.852, respectively). Cardiologists outperformed other medical disciplines and GPT-4o (p < 0.001), while emergency physicians followed in accuracy, though comparison to GPT-4o only indicated a trend (p = 0.068). CONCLUSION: GPT-4o demonstrated suboptimal accuracy with significant under- and over-recognition of AF in ECGs with confounding factors. Despite its potential as a supportive tool for ECG interpretation, its performance did not surpass that of medical practitioners, underscoring the continued importance of human expertise in complex diagnostics.",Avidan Y; Tabachnikov V; Court OB; Khoury R; Aker A
38329526,Is generative pre-trained transformer artificial intelligence (Chat-GPT) a reliable tool for guidelines synthesis? A preliminary evaluation for biologic CRSwNP therapy.,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"INTRODUCTION: Biologic therapies for Chronic Rhinosinusitis with Nasal Polyps (CRSwNP) have emerged as an auspicious treatment alternative. However, the ideal patient population, dosage, and treatment duration are yet to be well-defined. Moreover, biologic therapy has disadvantages, such as high costs and limited access. The proposal of a novel Artificial Intelligence (AI) algorithm offers an intriguing solution for optimizing decision-making protocols. METHODS: The AI algorithm was initially programmed to conduct a systematic literature review searching for the current primary guidelines on biologics' clinical efficacy and safety in treating CRSwNP. The review included a total of 12 studies: 6 systematic reviews, 4 expert consensus guidelines, and 2 surveys. Simultaneously, two independent human researchers conducted a literature search to compare the results. Subsequently, the AI was tasked to critically analyze the identified papers, highlighting strengths and weaknesses, thereby creating a decision-making algorithm and pyramid flow chart. RESULTS: The studies evaluated various biologics, including monoclonal antibodies targeting Interleukin-5 (IL-5), IL-4, IL-13, and Immunoglobulin E (IgE), assessing their effectiveness in different patient populations, such as those with comorbid asthma or refractory CRSwNP. Dupilumab, a monoclonal antibody targeting the IL-4 receptor alpha subunit, demonstrated significant improvement in nasal symptoms and quality of life in patients with CRSwNP in several randomized controlled trials and systematic reviews. Similarly, mepolizumab and reslizumab, which target IL-5, have also shown efficacy in reducing nasal polyp burden and improving symptoms in patients with CRSwNP, particularly those with comorbid asthma. However, additional studies are required to confirm the long-term efficacy and safety of these biologics in treating CRSwNP. CONCLUSIONS: Biologic therapies have surfaced as a promising treatment option for patients with severe or refractory CRSwNP; however, the optimal patient population, dosage, and treatment duration are yet to be defined. The application of AI in decision-making protocols and the creation of therapeutic algorithms for biologic drug selection, could offer fascinating future prospects in the management of CRSwNP.",Maniaci A; Saibene AM; Calvo-Henriquez C; Vaira L; Radulesco T; Michel J; Chiesa-Estomba C; Sowerby L; Lobo Duro D; Mayo-Yanez M; Maza-Solano J; Lechien JR; La Mantia I; Cocuzza S
39119408,Is ChatGPT a Reliable Source of Patient Information on Asthma?,2024,Cureus,,,,"INTRODUCTION: ChatGPT (OpenAI, San Francisco, CA, USA) is a novel artificial intelligence (AI) application that is used by millions of people, and the numbers are growing by the day. Because it has the potential to be a source of patient information, the study aimed to evaluate the ability of ChatGPT to answer frequently asked questions (FAQs) about asthma with consistent reliability, acceptability, and easy readability. METHODS: We collected 30 FAQs about asthma from the Global Initiative for Asthma website. ChatGPT was asked each question twice, by two different users, to assess for consistency. The responses were evaluated by five board-certified internal medicine physicians for reliability and acceptability. The consistency of responses was determined by the differences in evaluation between the two answers to the same question. The readability of all responses was measured using the Flesch Reading Ease Scale (FRES), the Flesch-Kincaid Grade Level (FKGL), and the Simple Measure of Gobbledygook (SMOG). RESULTS: Sixty responses were collected for evaluation. Fifty-six (93.33%) of the responses were of good reliability. The average rating of the responses was 3.65 out of 4 total points. 78.3% (n=47) of the responses were found acceptable by the evaluators to be the only answer for an asthmatic patient. Only two (6.67%) of the 30 questions had inconsistent answers. The average readability of all responses was determined to be 33.50+/-14.37 on the FRES, 12.79+/-2.89 on the FKGL, and 13.47+/-2.38 on the SMOG. CONCLUSION: Compared to online websites, we found that ChatGPT can be a reliable and acceptable source of information for asthma patients in terms of information quality. However, all responses were of difficult readability, and none followed the recommended readability levels. Therefore, the readability of this AI application requires improvement to be more suitable for patients.",Alabdulmohsen DM; Almahmudi MA; Alhashim JN; Almahdi MH; Alkishy EF; Almossabeh MJ; Alkhalifah SA
39329220,ChatGPT Solving Complex Kidney Transplant Cases: A Comparative Study With Human Respondents.,2024,Clinical transplantation,,,,"INTRODUCTION: ChatGPT has shown the ability to answer clinical questions in general medicine but may be constrained by the specialized nature of kidney transplantation. Thus, it is important to explore how ChatGPT can be used in kidney transplantation and how its knowledge compares to human respondents. METHODS: We prompted ChatGPT versions 3.5, 4, and 4 Visual (4 V) with 12 multiple-choice questions related to six kidney transplant cases from 2013 to 2015 American Society of Nephrology (ASN) fellowship program quizzes. We compared the performance of ChatGPT with US nephrology fellowship program directors, nephrology fellows, and the audience of the ASN's annual Kidney Week meeting. RESULTS: Overall, ChatGPT 4 V correctly answered 10 out of 12 questions, showing a performance level comparable to nephrology fellows (group majority correctly answered 9 of 12 questions) and training program directors (11 of 12). This surpassed ChatGPT 4 (7 of 12 correct) and 3.5 (5 of 12). All three ChatGPT versions failed to correctly answer questions where the consensus among human respondents was low. CONCLUSION: Each iterative version of ChatGPT performed better than the prior version, with version 4 V achieving performance on par with nephrology fellows and training program directors. While it shows promise in understanding and answering kidney transplantation questions, ChatGPT should be seen as a complementary tool to human expertise rather than a replacement.",Mankowski MA; Jaffe IS; Xu J; Bae S; Oermann EK; Aphinyanaphongs Y; McAdams-DeMarco MA; Lonze BE; Orandi BJ; Stewart D; Levan M; Massie A; Gentry S; Segev DL
37426402,Transforming Medical Education: Assessing the Integration of ChatGPT Into Faculty Workflows at a Caribbean Medical School.,2023,Cureus,,,,"INTRODUCTION: ChatGPT is a Large Language Model (LLM) which allows for natural language processing and interactions with users in a conversational style. Since its release in 2022, it has had a significant impact in many occupational fields, including medical education. We sought to gain insight into the extent and type of usage of ChatGPT at a Caribbean medical school, the American University of Antigua College of Medicine (AUA). METHODS: We administered a questionnaire to 87 full-time faculty at the school via email. We quantified and made graphical representations of the results via Qualtrics Experience Management software (QualtricsXM, Qualtrics, Provo, UT). Survey results were investigated using bar graph comparisons of absolute numbers and percentages for various categories related to ChatGPT usage, and descriptive statistics for Likert scale questions. RESULTS: We found an estimated 33% of faculty were currently using ChatGPT. There was broad acceptance of the program by those who were using it and most believed it should be an option for students. The primary task ChatGPT was being used for was multiple choice question (MCQ) generation. The primary concern faculty had was incorrect information being included in ChatGPT output. CONCLUSION: ChatGPT has been quickly adopted by a subset of the college faculty, demonstrating its growing acceptance. Given the level of approval expressed about the program, we believe ChatGPT will continue to form an important and expanding part of faculty workflows at AUA and in medical education in general.",Cross J; Robinson R; Devaraju S; Vaughans A; Hood R; Kayalackakom T; Honnavar P; Naik S; Sebastian R
38341993,Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage?,2024,The American journal of emergency medicine,,,,"INTRODUCTION: Chat-GPT is rapidly emerging as a promising and potentially revolutionary tool in medicine. One of its possible applications is the stratification of patients according to the severity of clinical conditions and prognosis during the triage evaluation in the emergency department (ED). METHODS: Using a randomly selected sample of 30 vignettes recreated from real clinical cases, we compared the concordance in risk stratification of ED patients between healthcare personnel and Chat-GPT. The concordance was assessed with Cohen's kappa, and the performance was evaluated with the area under the receiver operating characteristic curve (AUROC) curves. Among the outcomes, we considered mortality within 72 h, the need for hospitalization, and the presence of a severe or time-dependent condition. RESULTS: The concordance in triage code assignment between triage nurses and Chat-GPT was 0.278 (unweighted Cohen's kappa; 95% confidence intervals: 0.231-0.388). For all outcomes, the ROC values were higher for the triage nurses. The most relevant difference was found in 72-h mortality, where triage nurses showed an AUROC of 0.910 (0.757-1.000) compared to only 0.669 (0.153-1.000) for Chat-GPT. CONCLUSIONS: The current level of Chat-GPT reliability is insufficient to make it a valid substitute for the expertise of triage nurses in prioritizing ED patients. Further developments are required to enhance the safety and effectiveness of AI for risk stratification of ED patients.",Zaboli A; Brigo F; Sibilio S; Mian M; Turcato G
37589944,Complications Following Facelift and Neck Lift: Implementation and Assessment of Large Language Model and Artificial Intelligence (ChatGPT) Performance Across 16 Simulated Patient Presentations.,2023,Aesthetic plastic surgery,,,,"INTRODUCTION: ChatGPT represents a potential resource for patient guidance and education, with the possibility for quality improvement in healthcare delivery. The present study evaluates the role of ChatGPT as an interactive patient resource, and assesses its performance in identifying, triaging, and guiding patients with concerns of postoperative complications following facelift and neck lift surgery. METHODS: Sixteen patient profiles were generated to simulate postoperative patient presentations, with complications of varying acuity and severity. ChatGPT was assessed for its accuracy in generating a differential diagnosis, soliciting a history, providing the most-likely diagnosis, the appropriate disposition, treatments/interventions to begin from home, and red-flag symptoms necessitating an urgent presentation to the emergency department. RESULTS: Overall accuracy in providing a complete differential diagnosis in response to simulated presentations was 85%, with an accuracy of 88% in identifying the most-likely diagnosis after history-taking. However, appropriate patient dispositions were suggested in only 56% of cases. Relevant home treatments/interventions were suggested with an 82% accuracy, and red-flag symptoms with a 73% accuracy. A detailed analysis, stratified according to latency of postoperative presentation (<48 h, 48 h-1 week, or >1 week), and according to acuity of complications, is presented herein. CONCLUSIONS: ChatGPT overestimated the urgency of indicated patient dispositions in 44% of cases, concerning for potential unnecessary increase in healthcare resource utilization. Imperfect performance, and the tool's tendency for overinclusion in its responses, risk increasing patient anxiety and straining physician-patient relationships. While artificial intelligence has great potential in triaging postoperative patient concerns, and improving efficiency and resource utilization, ChatGPT's performance, in its current form, demonstrates a need for further refinement before its safe and effective implementation in facial aesthetic surgical practice. LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Abi-Rafeh J; Hanna S; Bassiri-Tehrani B; Kazan R; Nahai F
38764369,Evaluating ChatGPT's effectiveness and tendencies in Japanese internal medicine.,2024,Journal of evaluation in clinical practice,,,,"INTRODUCTION: ChatGPT, a large-scale language model, is a notable example of AI's potential in health care. However, its effectiveness in clinical settings, especially when compared to human physicians, is not fully understood. This study evaluates ChatGPT's capabilities and limitations in answering questions for Japanese internal medicine specialists, aiming to clarify its accuracy and tendencies in both correct and incorrect responses. METHODS: We utilized ChatGPT's answers on four sets of self-training questions for internal medicine specialists in Japan from 2020 to 2023. We ran three trials for each set to evaluate its overall accuracy and performance on nonimage questions. Subsequently, we categorized the questions into two groups: those ChatGPT consistently answered correctly (Confirmed Correct Answer, CCA) and those it consistently answered incorrectly (Confirmed Incorrect Answer, CIA). For these groups, we calculated the average accuracy rates and 95% confidence intervals based on the actual performance of internal medicine physicians on each question and analyzed the statistical significance between the two groups. This process was then similarly applied to the subset of nonimage CCA and CIA questions. RESULTS: ChatGPT's overall accuracy rate was 59.05%, increasing to 65.76% for nonimage questions. 24.87% of the questions had answers that varied between correct and incorrect in the three trials. Despite surpassing the passing threshold for nonimage questions, ChatGPT's accuracy was lower than that of human specialists. There was a significant variance in accuracy between CCA and CIA groups, with ChatGPT mirroring human physician patterns in responding to different question types. CONCLUSION: This study underscores ChatGPT's potential utility and limitations in internal medicine. While effective in some aspects, its dependence on question type and context suggests that it should supplement, not replace, professional medical judgment. Further research is needed to integrate Artificial Intelligence tools like ChatGPT more effectively into specialized medical practices.",Kaneda Y; Tayuinosho A; Tomoyose R; Takita M; Hamaki T; Tanimoto T; Ozaki A
40147063,Artificial intelligence versus orthopedic surgeons as an orthopedic consultant in the emergency department.,2025,Injury,,,,"INTRODUCTION: ChatGPT, a widely accessible AI program, has demonstrated potential in various healthcare applications, including emergency department (ED) triage, differential diagnosis, and patient education. However, its potential in providing recommendations to emergency department providers with orthopedic consultations has not been evaluated yet. METHODS: This study compared the performance of four board certified orthopedic surgeons, two attendings and two trauma fellows who take independent call at the same institution and ChatGPT-4 in responding to clinical scenarios commonly encountered in emergency departments. Five common orthopedic ED scenarios were developed (lateral malleolar ankle fractures, distal radius fractures, septic arthritis of the knee, shoulder dislocations, and Achilles tendon ruptures), each with four questions related to diagnosis, management, surgical indication, and patient counseling, totaling 20 questions. Responses were anonymized, coded, and evaluated by independent reviewers including emergency medicine physicians using a five-point Likert scale across five criteria: accuracy, completeness, helpfulness, specificity, and overall quality. RESULTS: When comparing the ratings of AI answers to non-AI responders, the AI answers were shown to be superior in completeness, helpfulness, specificity, and overall quality with no difference in regards to accuracy (p < 0.05). When considering question subtypes including diagnosis, management, treatment, and patient counseling, AI was shown to have superior scores in helpfulness, and specificity in diagnostic questions(p < 0.05). In addition, AI responses were superior in all the assessed categories when looking at the patient counseling questions (p < 0.05). When considering different clinical scenarios, AI outperformed non-AI groups in completeness in the distal radius fracture scenario. Furthermore, AI outperformed non-AI groups in helpfulness in the lateral malleolus fracture scenario. In the shoulder dislocation scenario, AI responses were more complete, helpful, and had a better overall quality. AI responses were non-inferior in the remaining categories of the different scenarios. CONCLUSION: Artificial intelligence exhibited non-inferior and often superior performance in common orthopedic-ED consultations compared to board certified orthopedic surgeons While current AI models are limited in their ability to integrate specific images and patient scenarios, our findings suggest AI can provide high quality recommendations for generic orthopedic consultations and with further development, will likely have an increasing role in the future.",Liu J; Segal K; Daher M; Ozolin J; Binder WD; Bergen M; McDonald CL; Owens BD; Antoci V
38507847,"Comparison of emergency medicine specialist, cardiologist, and chat-GPT in electrocardiography assessment.",2024,The American journal of emergency medicine,,,,"INTRODUCTION: ChatGPT, developed by OpenAI, represents the cutting-edge in its field with its latest model, GPT-4. Extensive research is currently being conducted in various domains, including cardiovascular diseases, using ChatGPT. Nevertheless, there is a lack of studies addressing the proficiency of GPT-4 in diagnosing conditions based on Electrocardiography (ECG) data. The goal of this study is to evaluate the diagnostic accuracy of GPT-4 when provided with ECG data, and to compare its performance with that of emergency medicine specialists and cardiologists. METHODS: This study has received approval from the Clinical Research Ethics Committee of Hitit University Medical Faculty on August 21, 2023 (decision no: 2023-91). Drawing on cases from the ""150 ECG Cases"" book, a total of 40 ECG cases were crafted into multiple-choice questions (comprising 20 everyday and 20 more challenging ECG questions). The participant pool included 12 emergency medicine specialists and 12 cardiology specialists. GPT-4 was administered the questions in a total of 12 separate sessions. The responses from the cardiology physicians, emergency medicine physicians, and GPT-4 were evaluated separately for each of the three groups. RESULTS: In the everyday ECG questions, GPT-4 demonstrated superior performance compared to both the emergency medicine specialists and the cardiology specialists (p < 0.001, p = 0.001). In the more challenging ECG questions, while Chat-GPT outperformed the emergency medicine specialists (p < 0.001), no significant statistical difference was found between Chat-GPT and the cardiology specialists (p = 0.190). Upon examining the accuracy of the total ECG questions, Chat-GPT was found to be more successful compared to both the Emergency Medicine Specialists and the cardiologists (p < 0.001, p = 0.001). CONCLUSION: Our study has shown that GPT-4 is more successful than emergency medicine specialists in evaluating both everyday and more challenging ECG questions. It performed better compared to cardiologists on everyday questions, but its performance aligned closely with that of the cardiologists as the difficulty of the questions increased.",Gunay S; Ozturk A; Ozerol H; Yigit Y; Erenler AK
39096711,"The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists.",2024,The American journal of emergency medicine,,,,"INTRODUCTION: GPT-4, GPT-4o and Gemini advanced, which are among the well-known large language models (LLMs), have the capability to recognize and interpret visual data. When the literature is examined, there are a very limited number of studies examining the ECG performance of GPT-4. However, there is no study in the literature examining the success of Gemini and GPT-4o in ECG evaluation. The aim of our study is to evaluate the performance of GPT-4, GPT-4o, and Gemini in ECG evaluation, assess their usability in the medical field, and compare their accuracy rates in ECG interpretation with those of cardiologists and emergency medicine specialists. METHODS: The study was conducted from May 14, 2024, to June 3, 2024. The book ""150 ECG Cases"" served as a reference, containing two sections: daily routine ECGs and more challenging ECGs. For this study, two emergency medicine specialists selected 20 ECG cases from each section, totaling 40 cases. In the next stage, the questions were evaluated by emergency medicine specialists and cardiologists. In the subsequent phase, a diagnostic question was entered daily into GPT-4, GPT-4o, and Gemini Advanced on separate chat interfaces. In the final phase, the responses provided by cardiologists, emergency medicine specialists, GPT-4, GPT-4o, and Gemini Advanced were statistically evaluated across three categories: routine daily ECGs, more challenging ECGs, and the total number of ECGs. RESULTS: Cardiologists outperformed GPT-4, GPT-4o, and Gemini Advanced in all three groups. Emergency medicine specialists performed better than GPT-4o in routine daily ECG questions and total ECG questions (p = 0.003 and p = 0.042, respectively). When comparing GPT-4o with Gemini Advanced and GPT-4, GPT-4o performed better in total ECG questions (p = 0.027 and p < 0.001, respectively). In routine daily ECG questions, GPT-4o also outperformed Gemini Advanced (p = 0.004). Weak agreement was observed in the responses given by GPT-4 (p < 0.001, Fleiss Kappa = 0.265) and Gemini Advanced (p < 0.001, Fleiss Kappa = 0.347), while moderate agreement was observed in the responses given by GPT-4o (p < 0.001, Fleiss Kappa = 0.514). CONCLUSION: While GPT-4o shows promise, especially in more challenging ECG questions, and may have potential as an assistant for ECG evaluation, its performance in routine and overall assessments still lags behind human specialists. The limited accuracy and consistency of GPT-4 and Gemini suggest that their current use in clinical ECG interpretation is risky.",Gunay S; Ozturk A; Yigit Y
39882522,Evolution of artificial intelligence in healthcare: a 30-year bibliometric study.,2024,Frontiers in medicine,,,,"INTRODUCTION: In recent years, the development of artificial intelligence (AI) technologies, including machine learning, deep learning, and large language models, has significantly supported clinical work. Concurrently, the integration of artificial intelligence with the medical field has garnered increasing attention from medical experts. This study undertakes a dynamic and longitudinal bibliometric analysis of AI publications within the healthcare sector over the past three decades to investigate the current status and trends of the fusion between medicine and artificial intelligence. METHODS: Following a search on the Web of Science, researchers retrieved all reviews and original articles concerning artificial intelligence in healthcare published between January 1993 and December 2023. The analysis employed Bibliometrix, Biblioshiny, and Microsoft Excel, incorporating the bibliometrix R package for data mining and analysis, and visualized the observed trends in bibliometrics. RESULTS: A total of 22,950 documents were collected in this study. From 1993 to 2023, there was a discernible upward trajectory in scientific output within bibliometrics. The United States and China emerged as primary contributors to medical artificial intelligence research, with Harvard University leading in publication volume among institutions. Notably, the rapid expansion of emerging topics such as COVID-19 and new drug discovery in recent years is noteworthy. Furthermore, the top five most cited papers in 2023 were all pertinent to the theme of ChatGPT. CONCLUSION: This study reveals a sustained explosive growth trend in AI technologies within the healthcare sector in recent years, with increasingly profound applications in medicine. Additionally, medical artificial intelligence research is dynamically evolving with the advent of new technologies. Moving forward, concerted efforts to bolster international collaboration and enhance comprehension and utilization of AI technologies are imperative for fostering novel innovations in healthcare.",Xie Y; Zhai Y; Lu G
39534227,Large language models in patient education: a scoping review of applications in medicine.,2024,Frontiers in medicine,,,,"INTRODUCTION: Large Language Models (LLMs) are sophisticated algorithms that analyze and generate vast amounts of textual data, mimicking human communication. Notable LLMs include GPT-4o by Open AI, Claude 3.5 Sonnet by Anthropic, and Gemini by Google. This scoping review aims to synthesize the current applications and potential uses of LLMs in patient education and engagement. MATERIALS AND METHODS: Following the PRISMA-ScR checklist and methodologies by Arksey, O'Malley, and Levac, we conducted a scoping review. We searched PubMed in June 2024, using keywords and MeSH terms related to LLMs and patient education. Two authors conducted the initial screening, and discrepancies were resolved by consensus. We employed thematic analysis to address our primary research question. RESULTS: The review identified 201 studies, predominantly from the United States (58.2%). Six themes emerged: generating patient education materials, interpreting medical information, providing lifestyle recommendations, supporting customized medication use, offering perioperative care instructions, and optimizing doctor-patient interaction. LLMs were found to provide accurate responses to patient queries, enhance existing educational materials, and translate medical information into patient-friendly language. However, challenges such as readability, accuracy, and potential biases were noted. DISCUSSION: LLMs demonstrate significant potential in patient education and engagement by creating accessible educational materials, interpreting complex medical information, and enhancing communication between patients and healthcare providers. Nonetheless, issues related to the accuracy and readability of LLM-generated content, as well as ethical concerns, require further research and development. Future studies should focus on improving LLMs and ensuring content reliability while addressing ethical considerations.",Aydin S; Karabacak M; Vlachos V; Margetis K
39675178,Use of a large language model (LLM) for ambulance dispatch and triage.,2025,The American journal of emergency medicine,,,,"INTRODUCTION: Large language models (LLMs) have grown in popularity in recent months and have demonstrated advanced clinical reasoning ability. Given the need to prioritize the sickest patients requesting emergency medical services (EMS), we attempted to identify if an LLM could accurately triage ambulance requests using real-world data from a major metropolitan area. METHODS: An LLM (ChatGPT 4o Mini, Open AI, San Francisco, CA, USA) with no prior task-specific training was given real ambulance requests from a major metropolitan city in the United States. Requests were batched into groups of four, and the LLM was prompted to identify which of the four patients should be prioritized. The same groupings of four requests were then shown to a panel of experienced critical care paramedics who voted on which patient should be prioritized. RESULTS: Across 98 groupings of four ambulance requests (392 total requests), the LLM agreed with the paramedic panel in most cases (76.5 %, n = 75). In groupings where the paramedic panel was unanimous in their decision (n = 48), the LLM agreed with the unanimous panel in 93.8 % of groupings (n = 45). CONCLUSIONS: Our preliminary analysis indicates LLMs may have the potential to become a useful tool for triage and resource allocation in emergency care settings, especially in cases where there is consensus among subject matter experts. Further research is needed to better understand and clarify how they may best be of service.",Shekhar AC; Kimbrell J; Saharan A; Stebel J; Ashley E; Abbott EE
38606229,Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.,2024,Cureus,,,,"INTRODUCTION: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions. METHODS: The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA). RESULTS: A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5).  Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.",Abbas A; Rehman MS; Rehman SS
39234726,Zero-Shot LLMs for Named Entity Recognition: Targeting Cardiac Function Indicators in German Clinical Texts.,2024,Studies in health technology and informatics,,,,"INTRODUCTION: Large Language Models (LLMs) like ChatGPT have become increasingly prevalent. In medicine, many potential areas arise where LLMs may offer added value. Our research focuses on the use of open-source LLM alternatives like Llama 3, Gemma, Mistral, and Mixtral to extract medical parameters from German clinical texts. We concentrate on German due to an observed gap in research for non-English tasks. OBJECTIVE: To evaluate the effectiveness of open-source LLMs in extracting medical parameters from German clinical texts, specially focusing on cardiovascular function indicators from cardiac MRI reports. METHODS: We extracted 14 cardiovascular function indicators, including left and right ventricular ejection fraction (LV-EF and RV-EF), from 497 variously formulated cardiac magnetic resonance imaging (MRI) reports. Our systematic analysis involved assessing the performance of Llama 3, Gemma, Mistral, and Mixtral models in terms of right annotation and named entity recognition (NER) accuracy. RESULTS: The analysis confirms strong performance with up to 95.4% right annotation and 99.8% NER accuracy across different architectures, despite the fact that these models were not explicitly fine-tuned for data extraction and the German language. CONCLUSION: The results strongly recommend using open-source LLMs for extracting medical parameters from clinical texts, including those in German, due to their high accuracy and effectiveness even without specific fine-tuning.",Plagwitz L; Neuhaus P; Yildirim K; Losch N; Varghese J; Buscher A
37276372,New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.,2023,Urology practice,,,,"INTRODUCTION: Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians. METHODS: One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning. RESULTS: ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers. CONCLUSIONS: ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.",Huynh LM; Bonebrake BT; Schultis K; Quach A; Deibert CM
39936270,ChatGPT versus physician-derived answers to drug-related questions.,2024,Danish medical journal,,,,"INTRODUCTION: Large language models have recently gained interest within the medical community. Their clinical impact is currently being investigated, with potential application in pharmaceutical counselling, which has yet to be assessed. METHODS: We performed a retrospective investigation of ChatGPT 3.5 and 4.0 in response to 49 consecutive inquiries encountered in the joint pharmaceutical counselling service of the Central and North Denmark regions. Answers were rated by comparing them with the answers generated by physicians. RESULTS: ChatGPT 3.5 and 4.0 provided answers rated better or equal in 39 (80%) and 48 (98%) cases, respectively, compared to the pharmaceutical counselling service. References did not accompany answers from ChatGPT, and ChatGPT did not elaborate on what would be considered most clinically relevant when providing multiple answers. CONCLUSIONS: In drug-related questions, ChatGPT (4.0) provided answers of a reasonably high quality. The lack of references and an occasionally limited clinical interpretation makes it less useful as a primary source of information. FUNDING: None. TRIAL REGISTRATION: Not relevant.",Helgestad OK; Hjelholt AJ; Vestergaard SV; Azuz S; Saedder EA; Overvad TF
38081765,ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model.,2023,BMJ health & care informatics,,,,"INTRODUCTION: Large language models such as ChatGPT have gained popularity for their ability to generate comprehensive responses to human queries. In the field of medicine, ChatGPT has shown promise in applications ranging from diagnostics to decision-making. However, its performance in medical examinations and its comparison to random guessing have not been extensively studied. METHODS: This study aimed to evaluate the performance of ChatGPT in the preinternship examination, a comprehensive medical assessment for students in Iran. The examination consisted of 200 multiple-choice questions categorised into basic science evaluation, diagnosis and decision-making. GPT-4 was used, and the questions were translated to English. A statistical analysis was conducted to assess the performance of ChatGPT and also compare it with a random test group. RESULTS: The results showed that ChatGPT performed exceptionally well, with 68.5% of the questions answered correctly, significantly surpassing the pass mark of 45%. It exhibited superior performance in decision-making and successfully passed all specialties. Comparing ChatGPT to the random test group, ChatGPT's performance was significantly higher, demonstrating its ability to provide more accurate responses and reasoning. CONCLUSION: This study highlights the potential of ChatGPT in medical licensing examinations and its advantage over random guessing. However, it is important to note that ChatGPT still falls short of human physicians in terms of diagnostic accuracy and decision-making capabilities. Caution should be exercised when using ChatGPT, and its results should be verified by human experts to ensure patient safety and avoid potential errors in the medical field.",Ebrahimian M; Behnam B; Ghayebi N; Sobhrakhshankhah E
40365297,Optimizing theranostics chatbots with context-augmented large language models.,2025,Theranostics,,,,"Introduction: Nuclear medicine theranostics is rapidly emerging, as an interdisciplinary therapy option with multi-dimensional considerations. Healthcare Professionals do not have the time to do in depth research on every therapy option. Personalized Chatbots might help to educate them. Chatbots using Large Language Models (LLMs), such as ChatGPT, are gaining interest addressing these challenges. However, chatbot performances often fall short in specific domains, which is critical in healthcare applications. Methods: This study develops a framework to examine the use of contextual augmentation to improve the performance of medical theranostic chatbots to create the first theranostic chatbot. Contextual augmentation involves providing additional relevant information to LLMs to improve their responses. We evaluate five state-of-the-art LLMs on questions translated into English and German. We compare answers generated with and without contextual augmentation, where the LLMs access pre-selected research papers via Retrieval Augmented Generation (RAG). We are using two RAG techniques: Naive RAG and Advanced RAG. Results: A user study and LLM-based evaluation assess answer quality across different metrics. Results show that Advanced RAG techniques considerably enhance LLM performance. Among the models, the best-performing variants are CLAUDE 3 OPUS and GPT-4O. These models consistently achieve the highest scores, indicating robust integration and utilization of contextual information. The most notable improvements between Naive RAG and Advanced RAG are observed in the GEMINI 1.5 and COMMAND R+ variants. Conclusion: This study demonstrates that contextual augmentation addresses the complexities inherent in theranostics. Despite promising results, key limitations include the biased selection of questions focusing primarily on PRRT, the need for comprehensive context documents. Future research should include a broader range of theranostics questions, explore additional RAG methods and aim to compare human and LLM evaluations more directly to enhance LLM performance further.",Koller P; Clement C; van Eijk A; Seifert R; Zhang J; Prenosil G; Sathekge MM; Herrmann K; Baum R; Weber WA; Rominger A; Shi K
39430700,Evaluating the comprehension and accuracy of ChatGPT's responses to diabetes-related questions in Urdu compared to English.,2024,Digital health,,,,"INTRODUCTION: Patients with diabetes require healthcare and information that are accurate and extensive. Large language models (LLMs) like ChatGPT herald the capacity to provide such exhaustive data. To determine (a) the comprehensiveness of ChatGPT's responses in Urdu to diabetes-related questions and (b) the accuracy of ChatGPT's Urdu responses when compared to its English responses. METHODS: A cross-sectional observational study was conducted. Two reviewers experienced in internal medicine and endocrinology graded 53 Urdu and English responses on diabetes knowledge, lifestyle, and prevention. A senior reviewer resolved discrepancies. Responses were assessed for comprehension and accuracy, then compared to English. RESULTS: Among the Urdu responses generated, only two of 53 (3.8%) questions were graded as comprehensive, and five of 53 (9.4%) were graded as correct but inadequate. We found that 25 of 53 (47.2%) questions were graded as mixed with correct and incorrect/outdated data, the most significant proportion of responses being graded as such. When considering the comparison of response scale grading the comparative accuracy of Urdu and English responses, no Urdu response (0.0%) was considered to have more accuracy than English. Most of the Urdu responses were found to have an accuracy less than that of English, an overwhelming majority of 49 of 53 (92.5%) responses. CONCLUSION: We found that although the ability to retrieve such information about diabetes is impressive, it can merely be used as an adjunct instead of a solitary source of information. Further work must be done to optimize Urdu responses in medical contexts to approximate the boundless potential it heralds.",Faisal S; Kamran TE; Khalid R; Haider Z; Siddiqui Y; Saeed N; Imran S; Faisal R; Jabeen M
37795422,Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment.,2023,Frontiers in medicine,,,,"INTRODUCTION: Recent developments in artificial intelligence large language models (LLMs), such as ChatGPT, have allowed for the understanding and generation of human-like text. Studies have found LLMs abilities to perform well in various examinations including law, business and medicine. This study aims to evaluate the performance of ChatGPT in the United Kingdom Medical Licensing Assessment (UKMLA). METHODS: Two publicly available UKMLA papers consisting of 200 single-best-answer (SBA) questions were screened. Nine SBAs were omitted as they contained images that were not suitable for input. Each question was assigned a specialty based on the UKMLA content map published by the General Medical Council. A total of 191 SBAs were inputted in ChatGPT-4 through three attempts over the course of 3 weeks (once per week). RESULTS: ChatGPT scored 74.9% (143/191), 78.0% (149/191) and 75.6% (145/191) on three attempts, respectively. The average of all three attempts was 76.3% (437/573) with a 95% confidence interval of (74.46% and 78.08%). ChatGPT answered 129 SBAs correctly and 32 SBAs incorrectly on all three attempts. On three attempts, ChatGPT performed well in mental health (8/9 SBAs), cancer (11/14 SBAs) and cardiovascular (10/13 SBAs). On three attempts, ChatGPT did not perform well in clinical haematology (3/7 SBAs), endocrine and metabolic (2/5 SBAs) and gastrointestinal including liver (3/10 SBAs). Regarding to response consistency, ChatGPT provided correct answers consistently in 67.5% (129/191) of SBAs but provided incorrect answers consistently in 12.6% (24/191) and inconsistent response in 19.9% (38/191) of SBAs, respectively. DISCUSSION AND CONCLUSION: This study suggests ChatGPT performs well in the UKMLA. There may be a potential correlation between specialty performance. LLMs ability to correctly answer SBAs suggests that it could be utilised as a supplementary learning tool in medical education with appropriate medical educator supervision.",Lai UH; Wu KS; Hsu TY; Kan JKC
39290564,Zero-shot evaluation of ChatGPT for food named-entity recognition and linking.,2024,Frontiers in nutrition,,,,"INTRODUCTION: Recognizing and extracting key information from textual data plays an important role in intelligent systems by maintaining up-to-date knowledge, reinforcing informed decision-making, question-answering, and more. It is especially apparent in the food domain, where critical information guides the decisions of nutritionists and clinicians. The information extraction process involves two natural language processing tasks named entity recognition-NER and named entity linking-NEL. With the emergence of large language models (LLMs), especially ChatGPT, many areas began incorporating its knowledge to reduce workloads or simplify tasks. In the field of food, however, we noticed an opportunity to involve ChatGPT in NER and NEL. METHODS: To assess ChatGPT's capabilities, we have evaluated its two versions, ChatGPT-3.5 and ChatGPT-4, focusing on their performance across both NER and NEL tasks, emphasizing food-related data. To benchmark our results in the food domain, we also investigated its capabilities in a more broadly investigated biomedical domain. By evaluating its zero-shot capabilities, we were able to ascertain the strengths and weaknesses of the two versions of ChatGPT. RESULTS: Despite being able to show promising results in NER compared to other models. When tasked with linking entities to their identifiers from semantic models ChatGPT's effectiveness falls drastically. DISCUSSION: While the integration of ChatGPT holds potential across various fields, it is crucial to approach its use with caution, particularly in relying on its responses for critical decisions in food and bio-medicine.",Ogrinc M; Korousic Seljak B; Eftimov T
38331591,[The spring of artificial intelligence: AI vs. expert for internal medicine cases].,2024,La Revue de medecine interne,,,,"INTRODUCTION: The ""Printemps de la Medecine Interne"" are training days for Francophone internists. The clinical cases presented during these days are complex. This study aims to evaluate the diagnostic capabilities of non-specialized artificial intelligence (language models) ChatGPT-4 and Bard by confronting them with the puzzles of the ""Printemps de la Medecine Interne"". METHOD: Clinical cases from the ""Printemps de la Medecine Interne"" 2021 and 2022 were submitted to two language models: ChatGPT-4 and Bard. In case of a wrong answer, a second attempt was offered. We then compared the responses of human internist experts to those of artificial intelligence. RESULTS: Of the 12 clinical cases submitted, human internist experts diagnosed nine, ChatGPT-4 diagnosed three, and Bard diagnosed one. One of the cases solved by ChatGPT-4 was not solved by the internist expert. The artificial intelligence had a response time of a few seconds. CONCLUSIONS: Currently, the diagnostic skills of ChatGPT-4 and Bard are inferior to those of human experts in solving complex clinical cases but are very promising. Recently made available to the general public, they already have impressive capabilities, questioning the role of the diagnostic physician. It would be advisable to adapt the rules or subjects of future ""Printemps de la Medecine Interne"" so that they are not solved by a public language model.",Albaladejo A; Lorleac'h A; Allain JS
38985176,[What is the potential of ChatGPT for qualified patient information? : Attempt of a structured analysis on the basis of a survey regarding complementary and alternative medicine (CAM) in rheumatology].,2025,Zeitschrift fur Rheumatologie,,,,"INTRODUCTION: The chatbot ChatGPT represents a milestone in the interaction between humans and large databases that are accessible via the internet. It facilitates the answering of complex questions by enabling a communication in everyday language. Therefore, it is a potential source of information for those who are affected by rheumatic diseases. The aim of our investigation was to find out whether ChatGPT (version 3.5) is capable of giving qualified answers regarding the application of specific methods of complementary and alternative medicine (CAM) in three rheumatic diseases: rheumatoid arthritis (RA), systemic lupus erythematosus (SLE) and granulomatosis with polyangiitis (GPA). In addition, it was investigated how the answers of the chatbot were influenced by the wording of the question. METHODS: The questioning of ChatGPT was performed in three parts. Part A consisted of an open question regarding the best way of treatment of the respective disease. In part B, the questions were directed towards possible indications for the application of CAM in general in one of the three disorders. In part C, the chatbot was asked for specific recommendations regarding one of three CAM methods: homeopathy, ayurvedic medicine and herbal medicine. Questions in parts B and C were expressed in two modifications: firstly, it was asked whether the specific CAM was applicable at all in certain rheumatic diseases. The second question asked which procedure of the respective CAM method worked best in the specific disease. The validity of the answers was checked by using the ChatGPT reliability score, a Likert scale ranging from 1 (lowest validity) to 7 (highest validity). RESULTS: The answers to the open questions of part A had the highest validity. In parts B and C, ChatGPT suggested a variety of CAM applications that lacked scientific evidence. The validity of the answers depended on the wording of the questions. If the question suggested the inclination to apply a certain CAM, the answers often lacked the information of missing evidence and were graded with lower score values. CONCLUSION: The answers of ChatGPT (version 3.5) regarding the applicability of CAM in selected rheumatic diseases are not convincingly based on scientific evidence. In addition, the wording of the questions affects the validity of the information. Currently, an uncritical application of ChatGPT as an instrument for patient information cannot be recommended.",Keysser G; Pfeil A; Reuss-Borst M; Frohne I; Schultz O; Sander O
38251407,"Benzodiazepine Boom: Tracking Etizolam, Pyrazolam, and Flubromazepam from Pre-UK Psychoactive Act 2016 to Present Using Analytical and Social Listening Techniques.",2024,"Pharmacy (Basel, Switzerland)",,,,"INTRODUCTION: The designer benzodiazepine (DBZD) market continues to expand whilst evading regulatory controls. The widespread adoption of social media by pro-drug use communities encourages positive discussions around DBZD use/misuse, driving demand. This research addresses the evolution of three popular DBZDs, etizolam (E), flubromazepam (F), and pyrazolam (P), available on the drug market for over a decade, comparing the quantitative chemical analyses of tablet samples, purchased from the internet prior to the implementation of the Psychoactive Substances Act UK 2016, with the thematic netnographic analyses of social media content. METHOD: Drug samples were purchased from the internet in early 2016. The characterisation of all drug batches were performed using UHPLC-MS and supported with (1)H NMR. In addition, netnographic studies across the platforms X (formerly Twitter) and Reddit, between 2016-2023, were conducted. The latter was supported by both manual and artificial intelligence (AI)-driven thematic analyses, using numerous.ai and ChatGPT, of social media threads and discussions. RESULTS: UHPLC-MS confirmed the expected drug in every sample, showing remarkable inter/intra batch variability across all batches (E = 13.8 +/- 0.6 to 24.7 +/- 0.9 mg; F = 4.0 +/- 0.2 to 23.5 +/- 0.8 mg; P = 5.2 +/- 0.2 to 11.5 +/- 0.4 mg). (1)H NMR could not confirm etizolam as a lone compound in any etizolam batch. Thematic analyses showed etizolam dominated social media discussions (59% of all posts), with 24.2% of posts involving sale/purchase and 17.8% detailing new administration trends/poly-drug use scenarios. Artificial intelligence confirmed three of the top five trends identified manually. CONCLUSIONS: Purity variability identified across all tested samples emphasises the increased potential health risks associated with DBZD consumption. We propose the global DBZD market is exacerbated by surface web social media discussions, recorded across X and Reddit. Despite the appearance of newer analogues, these three DBZDs remain prevalent and popularised. Reporting themes on harm/effects and new developments in poly-drug use trends, demand for DBZDs continues to grow, despite their potent nature and potential risk to life. It is proposed that greater controls and constant live monitoring of social media user content is warranted to drive active regulation strategies and targeted, effective, harm reduction strategies.",Mullin A; Scott M; Vaccaro G; Floresta G; Arillotta D; Catalani V; Corkery JM; Stair JL; Schifano F; Guirguis A
38575866,Bibliometric analysis of ChatGPT in medicine.,2024,International journal of emergency medicine,,,,"INTRODUCTION: The emergence of artificial intelligence (AI) chat programs has opened two distinct paths, one enhancing interaction and another potentially replacing personal understanding. Ethical and legal concerns arise due to the rapid development of these programs. This paper investigates academic discussions on AI in medicine, analyzing the context, frequency, and reasons behind these conversations. METHODS: The study collected data from the Web of Science database on articles containing the keyword ""ChatGPT"" published from January to September 2023, resulting in 786 medically related journal articles. The inclusion criteria were peer-reviewed articles in English related to medicine. RESULTS: The United States led in publications (38.1%), followed by India (15.5%) and China (7.0%). Keywords such as ""patient"" (16.7%), ""research"" (12%), and ""performance"" (10.6%) were prevalent. The Cureus Journal of Medical Science (11.8%) had the most publications, followed by the Annals of Biomedical Engineering (8.3%). August 2023 had the highest number of publications (29.3%), with significant growth between February to March and April to May. Medical General Internal (21.0%) was the most common category, followed by Surgery (15.4%) and Radiology (7.9%). DISCUSSION: The prominence of India in ChatGPT research, despite lower research funding, indicates the platform's popularity and highlights the importance of monitoring its use for potential medical misinformation. China's interest in ChatGPT research suggests a focus on Natural Language Processing (NLP) AI applications, despite public bans on the platform. Cureus' success in publishing ChatGPT articles can be attributed to its open-access, rapid publication model. The study identifies research trends in plastic surgery, radiology, and obstetric gynecology, emphasizing the need for ethical considerations and reliability assessments in the application of ChatGPT in medical practice. CONCLUSION: ChatGPT's presence in medical literature is growing rapidly across various specialties, but concerns related to safety, privacy, and accuracy persist. More research is needed to assess its suitability for patient care and implications for non-medical use. Skepticism and thorough review of research are essential, as current studies may face retraction as more information emerges.",Gande S; Gould M; Ganti L
38354991,Accuracy of Online Artificial Intelligence Models in Primary Care Settings.,2024,American journal of preventive medicine,,,,"INTRODUCTION: The importance of preventive medicine and primary care in the sphere of public health is expanding, yet a gap exists in the utilization of recommended medical services. As patients increasingly turn to online resources for supplementary advice, the role of artificial intelligence (AI) in providing accurate and reliable information has emerged. The present study aimed to assess ChatGPT-4's and Google Bard's capacity to deliver accurate recommendations in preventive medicine and primary care. METHODS: Fifty-six questions were formulated and presented to ChatGPT-4 in June 2023 and Google Bard in October 2023, and the responses were independently reviewed by two physicians, with each answer being classified as ""accurate,"" ""inaccurate,"" or ""accurate with missing information."" Disagreements were resolved by a third physician. RESULTS: Initial inter-reviewer agreement on grading was substantial (Cohen's Kappa was 0.76, 95%CI [0.61-0.90] for ChatGPT-4 and 0.89, 95%CI [0.79-0.99] for Bard). After reaching a consensus, 28.6% of ChatGPT-4-generated answers were deemed accurate, 28.6% inaccurate, and 42.8% accurate with missing information. In comparison, 53.6% of Bard-generated answers were deemed accurate, 17.8% inaccurate, and 28.6% accurate with missing information. Responses to CDC and immunization-related questions showed notable inaccuracies (80%) in both models. CONCLUSIONS: ChatGPT-4 and Bard demonstrated potential in offering accurate information in preventive care. It also brought to light the critical need for regular updates, particularly in the rapidly evolving areas of medicine. A significant proportion of the AI models' responses were deemed ""accurate with missing information,"" emphasizing the importance of viewing AI tools as complementary resources when seeking medical information.",Kassab J; Hadi El Hajjar A; Wardrop RM 3rd; Brateanu A
39552949,Performance of ChatGPT in emergency medicine residency exams in Qatar: A comparative analysis with resident physicians.,2024,Qatar medical journal,,,,"INTRODUCTION: The inclusion of artificial intelligence (AI) in the healthcare sector has transformed medical practices by introducing innovative techniques for medical education, diagnosis, and treatment strategies. In medical education, the potential of AI to enhance learning and assessment methods is being increasingly recognized. This study aims to evaluate the performance of OpenAI's Chat Generative Pre-Trained Transformer (ChatGPT) in emergency medicine (EM) residency examinations in Qatar and compare it with the performance of resident physicians. METHODS: A retrospective descriptive study with a mixed-methods design was conducted in August 2023. EM residents' examination scores were collected and compared with the performance of ChatGPT on the same examinations. The examinations consisted of multiple-choice questions (MCQs) from the same faculty responsible for Qatari Board EM examinations. ChatGPT's performance on these examinations was analyzed and compared with residents across various postgraduate years (PGY). RESULTS: The study included 238 emergency department residents from PGY1 to PGY4 and compared their performances with ChatGPT. ChatGPT scored consistently higher than resident groups in all examination categories. However, a notable decline in passing rates was observed among senior residents, indicating a potential misalignment between examination performance and practical competencies. Another likely reason can be the impact of the COVID-19 pandemic on their learning experience, knowledge acquisition, and consolidation. CONCLUSION: ChatGPT demonstrated significant proficiency in the theoretical knowledge of EM, outperforming resident physicians in examination settings. This finding suggests the potential of AI as a supplementary tool in medical education.",Iftikhar H; Anjum S; Bhutta ZA; Najam M; Bashir K
37868075,Teaching AI Ethics in Medical Education: A Scoping Review of Current Literature and Practices.,2023,Perspectives on medical education,,,,"INTRODUCTION: The increasing use of Artificial Intelligence (AI) in medicine has raised ethical concerns, such as patient autonomy, bias, and transparency. Recent studies suggest a need for teaching AI ethics as part of medical curricula. This scoping review aimed to represent and synthesize the literature on teaching AI ethics as part of medical education. METHODS: The PRISMA-SCR guidelines and JBI methodology guided a literature search in four databases (PubMed, Embase, Scopus, and Web of Science) for the past 22 years (2000-2022). To account for the release of AI-based chat applications, such as ChatGPT, the literature search was updated to include publications until the end of June 2023. RESULTS: 1384 publications were originally identified and, after screening titles and abstracts, the full text of 87 publications was assessed. Following the assessment of the full text, 10 publications were included for further analysis. The updated literature search identified two additional relevant publications from 2023 were identified and included in the analysis. All 12 publications recommended teaching AI ethics in medical curricula due to the potential implications of AI in medicine. Anticipated ethical challenges such as bias were identified as the recommended basis for teaching content in addition to basic principles of medical ethics. Case-based teaching using real-world examples in interactive seminars and small groups was recommended as a teaching modality. CONCLUSION: This scoping review reveals a scarcity of literature on teaching AI ethics in medical education, with most of the available literature being recent and theoretical. These findings emphasize the importance of more empirical studies and foundational definitions of AI ethics to guide the development of teaching content and modalities. Recognizing AI's significant impact of AI on medicine, additional research on the teaching of AI ethics in medical education is needed to best prepare medical students for future ethical challenges.",Weidener L; Fischer M
39454451,The utility of ChatGPT in gender-affirming mastectomy education.,2024,"Journal of plastic, reconstructive & aesthetic surgery : JPRAS",,,,"INTRODUCTION: The integration of AI such as ChatGPT in medicine has been showing promise in enhancing patient education. Gender-affirming mastectomy (GAM) is a surgical procedure designed to help individuals transition to their self-identified gender, playing a crucial role in mitigating psychological distress for many transmasculine and non-binary (TNB) patients. With increased demand and attention towards GAM, plastic and reconstructive surgeons may rely on AI-driven chatbots as an accessible, accurate, and patient-driven model for relevant details on this procedure. SPECIFIC AIM(S): This study aimed to assess the quality and readability of information provided by ChatGPT in response to frequently asked questions (FAQs) related to GAM. METHODS: Inspired by online forums and physician websites, 10 frequently asked questions (FAQs) about pre- and postoperative topics were submitted to ChatGPT and assessed using validated readability score measures and expert interpretation. RESULTS: The average readability score was 16.0 +/- 1, indicating a college or graduate reading level. Mean accuracy, comprehensiveness, and danger scores were 8.8 +/- 0.5, 7.8 +/- 0.7, and 2.2 +/- 0.4, respectively. Although physicians appreciated ChatGPT's tone, patient autonomy, and advice to seek professional medical and mental help, they also cited instance of generic information, misinformation, support of debated techniques, and pathologization of gender dysphoria. CONCLUSION: Even with its promise in providing accurate and comprehensive information on GAM, ChatGPT's current limitations suggest caution as a supplementary tool to physician consultation.",Snee I; Lava CX; Li KR; Corral GD
39224724,Assessing the Quality and Reliability of AI-Generated Responses to Common Hypertension Queries.,2024,Cureus,,,,"INTRODUCTION: The integration of artificial intelligence (AI) in healthcare, particularly through language models like ChatGPT and ChatSonic, has gained substantial attention. This article explores the utilization of these AI models to address patient queries related to hypertension, emphasizing their potential to enhance health literacy and disease understanding. The study aims to compare the quality and reliability of responses generated by ChatGPT and ChatSonic in addressing common patient queries about hypertension and evaluate these AI models using the Global Quality Scale (GQS) and the Modified DISCERN scale. METHODS: A virtual cross-sectional observational study was conducted over one month, starting in October 2023. Ten common patient queries regarding hypertension were presented to ChatGPT (https://chat.openai.com/) and ChatSonic (https://writesonic.com/chat), and the responses were recorded. Two internal medicine physicians assessed the responses using the GQS and the Modified DISCERN scale. Statistical analysis included Cohen's Kappa values for inter-rater agreement. RESULTS: The study evaluated responses from ChatGPT and ChatSonic for 10 patient queries. Assessors observed variations in the quality and reliability assessments between the two AI models. Cohen's Kappa values indicated minimal agreement between the evaluators for both the GQS and Modified DISCERN scale. CONCLUSIONS: This study highlights the variations in the assessment of responses generated by ChatGPT and ChatSonic for hypertension-related queries. The findings highlight the need for ongoing monitoring and fact-checking of AI-generated responses.",Vinufrancis A; Al Hussein H; Patel HV; Nizami A; Singh A; Nunez B; Abdel-Aal AM
40066104,Introducing AI-generated cases (AI-cases) & standardized clients (AI-SCs) in communication training for veterinary students: perceptions and adoption challenges.,2024,Frontiers in veterinary science,,,,"INTRODUCTION: The integration of Artificial Intelligence (AI) into medical education and healthcare has grown steadily over these past couple of years, though its application in veterinary education and practice remains relatively underexplored. This study is among the first to introduce veterinary students to AI-generated cases (AI-cases) and AI-standardized clients (AI-SCs) for teaching and learning communication skills. The study aimed to evaluate students' beliefs and perceptions surrounding the use of AI in veterinary education, with specific focus on communication skills training. METHODS: Conducted at Texas Tech University School of Veterinary Medicine (TTU SVM) during the Spring 2024 semester, the study included pre-clinical veterinary students (n = 237), who participated in a 90-min communication skills laboratory activity. Each class was introduced to two AI-cases and two AI-SCs, developed using OpenAI's ChatGPT-3.5. The Calgary Cambridge Guide (CCG) served as the framework for practicing communication skills. RESULTS: Results showed that although students recognized the widespread use of AI in everyday life, their familiarity, comfort and application of AI in veterinary education were limited. Notably, upper-year students were more hesitant to adopt AI-based tools, particularly in communication skills training. DISCUSSION: The findings suggest that veterinary institutions should prioritize AI-literacy and further explore how AI can enhance and complement communication training, veterinary education and practice.",Artemiou E; Hooper S; Dascanio L; Schmidt M; Gilbert G
40061376,A bibliometric analysis of the advance of artificial intelligence in medicine.,2025,Frontiers in medicine,,,,"INTRODUCTION: The integration of artificial intelligence (AI) into medicine has ushered an era of unprecedented innovation, with substantial impacts on healthcare delivery and patient outcomes. Understanding the current development, primary research focuses, and key contributors in AI applications in medicine through bibliometric analysis is essential. METHODS: For this research, we utilized the Web of Science Core Collection as our main database and performed a review of literature covering the period from January 2019 to December 2023. VOSviewer and R-bibliometrix were performed to conduct bibliometric analysis and network visualization, including the number of publications, countries, journals, citations, authors, and keywords. RESULTS: A total of 1,811 publications on research for AI in medicine were released across 565 journals by 12,376 authors affiliated with 3,583 institutions from 97 countries. The United States became the foremost producer of scholarly works, significantly impacting the field. Harvard Medical School exhibited the highest publication count among all institutions. The Journal of Medical Internet Research achieved the highest H-index (19), publication count (76), and total citations (1,495). Four keyword clusters were identified, covering AI applications in digital health, COVID-19 and ChatGPT, precision medicine, and public health epidemiology. ""Outcomes"" and ""Risk"" demonstrated a notable upward trend, indicating the utilization of AI in engaging with clinicians and patients to discuss patients' health condition risks, foreshadowing future research focal points. CONCLUSION: Analyzing our bibliometric data allowed us to identify progress, focus areas, and emerging fields in AI for medicine, pointing to potential future research directions. Since 2019, there has been a steady rise in publications related to AI in medicine, indicating its rapid growth. In addition, we reviewed journals and significant publications to pinpoint prominent countries, institutions, and academics. Researchers will gain important insights into the current landscape, collaborative frameworks, and key research topics in the field from this study. The findings suggest directions for future research.",Lin M; Lin L; Lin L; Lin Z; Yan X
40034889,Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture.,2025,Cureus,,,,"INTRODUCTION: The rise of artificial intelligence (AI), including generative chatbots like ChatGPT (OpenAI, San Francisco, CA, USA), has revolutionized many fields, including healthcare. Patients have gained the ability to prompt chatbots to generate purportedly accurate and individualized healthcare content. This study analyzed the readability and quality of answers to Achilles tendon rupture questions from six generative AI chatbots to evaluate and distinguish their potential as patient education resources. METHODS: The six AI models used were ChatGPT 3.5, ChatGPT 4, Gemini 1.0 (previously Bard; Google, Mountain View, CA, USA), Gemini 1.5 Pro, Claude (Anthropic, San Francisco, CA, USA) and Grok (xAI, Palo Alto, CA, USA) without prior prompting. Each was asked 10 common patient questions about Achilles tendon rupture, determined by five orthopaedic surgeons. The readability of generative responses was measured using Flesch-Kincaid Reading Grade Level, Gunning Fog, and SMOG (Simple Measure of Gobbledygook). The response quality was subsequently graded using the DISCERN criteria by five blinded orthopaedic surgeons. RESULTS: Gemini 1.0 generated statistically significant differences in ease of readability (closest to average American reading level) than responses from ChatGPT 3.5, ChatGPT 4, and Claude. Additionally, mean DISCERN scores demonstrated significantly higher quality of responses from Gemini 1.0 (63.0+/-5.1) and ChatGPT 4 (63.8+/-6.2) than ChatGPT 3.5 (53.8+/-3.8), Claude (55.0+/-3.8), and Grok (54.2+/-4.8). However, the overall quality (question 16, DISCERN) of each model was averaged and graded at an above-average level (range, 3.4-4.4). DISCUSSION AND CONCLUSION: Our results indicate that generative chatbots can potentially serve as patient education resources alongside physicians. Although some models lacked sufficient content, each performed above average in overall quality. With the lowest readability and highest DISCERN scores, Gemini 1.0 outperformed ChatGPT, Claude, and Grok and potentially emerged as the simplest and most reliable generative chatbot regarding management of Achilles tendon rupture.",Collins CE; Giammanco PA; Guirgus M; Kricfalusi M; Rice RC; Nayak R; Ruckle D; Filler R; Elsissy JG
40233367,Comparison of ChatGPT's Diagnostic and Management Accuracy of Foot and Ankle Bone-Related Pathologies to Orthopaedic Surgeons.,2025,The Journal of the American Academy of Orthopaedic Surgeons,,,,"INTRODUCTION: The steep rise in utilization of large language model chatbots, such as ChatGPT, has spilled into medicine in recent years. The newest version of ChatGPT, ChatGPT-4, has passed medical licensure examinations and, specifically in orthopaedics, has performed at the level of a postgraduate level three orthopaedic surgery resident on the Orthopaedic In-Service Training Examination question bank sets. The purpose of this study was to evaluate ChatGPT-4's diagnostic and decision-making capacity in the clinical management of bone-related injuries of the foot and ankle. METHODS: Eight bone-related foot and ankle orthopaedic cases were presented to ChatGPT-4 and subsequently evaluated by three fellowship-trained foot and ankle orthopaedic surgeons. Cases were scored using criteria on a Likert scale, graded from a total score of 5 (lowest) to 25 (highest) across five criteria. ChatGPT-4 was referred to as ""Dr. GPT,"" establishing a peer dynamic so that the role of an orthopaedic surgeon was emulated by the chatbot. RESULTS: The average score across all criteria for each case was 4.53 of 5, noting an overall average sum score of 22.7 of 25 for all cases. The pathology with the highest score was the second metatarsal stress fracture (24.3), whereas the case with the lowest score was hallux rigidus (21.3). Kendall correlation analysis of interrater reliability showed variable correlation among surgeons, without statistical significance. CONCLUSION: ChatGPT-4 effectively diagnosed and provided appropriate treatment options for simple bone-related foot and ankle cases. Importantly, ChatGPT did not fabricate treatment options (ie, hallucination phenomenon), which has been previously well-documented in the literature, notably receiving its second-highest overall average score in this criterion. ChatGPT struggled to provide comprehensive information beyond standard treatment options. Overall, ChatGPT has the potential to serve as a widely accessible resource for patients and nonorthopaedic clinicians, although limitations may exist in the delivery of comprehensive information.",Essis MD; Hartman H; Tung WS; Oh I; Peden S; Gianakos AL
39851791,"ChatGPT, Google, or PINK? Who Provides the Most Reliable Information on Side Effects of Systemic Therapy for Early Breast Cancer?",2024,Clinics and practice,,,,"Introduction: The survival in early breast cancer (BC) has been significantly improved thanks to numerous new drugs. Nevertheless, the information about the need for systemic therapy, especially chemotherapy, represents an additional stress factor for patients. A common coping strategy is searching for further information, traditionally via search engines or websites, but artificial intelligence (AI) is also increasingly being used. Who provides the most reliable information is now unclear. Material and Methods: AI in the form of ChatGPT 3.5 and 4.0, Google, and the website of PINK, a provider of a prescription-based mobile health app for patients with BC, were compared to determine the validity of the statements on the five most common side effects of nineteen approved drugs and one drug with pending approval (Ribociclib) for the systemic treatment of BC. For this purpose, the drugs were divided into three groups: chemotherapy, targeted therapy, and endocrine therapy. The reference for the comparison was the prescribing information of the respective drug. A congruence score was calculated for the information on side effects: correct information (2 points), generally appropriate information (1 point), and otherwise no point. The information sources were then compared using a Friedmann test and a Bonferroni-corrected post-hoc test. Results: In the overall comparison, ChatGPT 3.5 received the best score with a congruence of 67.5%, followed by ChatGPT 4.0 with 67.0%, PINK with 59.5%, and with Google 40.0% (p < 0.001). There were also significant differences when comparing the individual subcategories, with the best congruence achieved by PINK (73.3%, p = 0.059) in the chemotherapy category, ChatGPT 4.0 (77.5%; p < 0.001) in the targeted therapy category, and ChatGPT 3.5 (p = 0.002) in the endocrine therapy category. Conclusions: Artificial intelligence and professional online information websites provide the most reliable information on the possible side effects of the systemic treatment of early breast cancer, but congruence with prescribing information is limited. The medical consultation should still be considered the best source of information.",Lukac S; Griewing S; Leinert E; Dayan D; Heitmeir B; Wallwiener M; Janni W; Fink V; Ebner F
38728938,"Comparative analysis of ChatGPT, Gemini and emergency medicine specialist in ESI triage assessment.",2024,The American journal of emergency medicine,,,,"INTRODUCTION: The term Artificial Intelligence (AI) was first coined in the 1960s and has made significant progress up to the present day. During this period, numerous AI applications have been developed. GPT-4 and Gemini are two of the best-known of these AI models. As a triage system The Emergency Severity Index (ESI) is currently one of the most commonly used for effective patient triage in the emergency department. The aim of this study is to evaluate the performance of GPT-4, Gemini, and emergency medicine specialists in ESI triage against each other; furthermore, it aims to contribute to the literature on the usability of these AI programs in emergency department triage. METHODS: Our study was conducted between February 1, 2024, and February 29, 2024, among emergency medicine specialists in Turkey, as well as with GPT-4 and Gemini. Ten emergency medicine specialists were included in our study but as a limitation the emergency medicine specialists participating in the study do not frequently use the ESI triage model in daily practice. In the first phase of our study, 100 case examples related to adult or trauma patients were extracted from the sample and training cases found in the ESI Implementation Handbook. In the second phase of our study, the provided responses were categorized into three groups: correct triage, over-triage, and under-triage. In the third phase of our study, the questions were categorized according to the correct triage responses. RESULTS: In the results of our study, a statistically significant difference was found between the three groups in terms of correct triage, over-triage, and under-triage (p < 0.001). GPT-4 was found to have the highest correct triage rate with an average of 70.60 (+/-3.74), while Gemini had the highest over-triage rate with an average of 35.2 (+/-2.93) (p < 0.001). The highest under-triage rate was observed in emergency medicine specialists (32.90 (+/-11.83)). In the ESI 1-2 class, Gemini had a correct triage rate of 87.77%, GPT-4 had 85.11%, and emergency medicine specialists had 49.33%. CONCLUSION: In conclusion, our study shows that both GPT-4 and Gemini can accurately triage critical and urgent patients in ESI 1&2 groups at a high rate. Furthermore, GPT-4 has been more successful in ESI triage for all patients. These results suggest that GPT-4 and Gemini could assist in accurate ESI triage of patients in emergency departments.",Meral G; Ates S; Gunay S; Ozturk A; Kusdogan M
38973528,Evaluation of online chat-based artificial intelligence responses about inflammatory bowel disease and diet.,2024,European journal of gastroenterology & hepatology,,,,"INTRODUCTION: The USA has the highest age-standardized prevalence of inflammatory bowel disease (IBD). Both genetic and environmental factors have been implicated in IBD flares and multiple strategies are centered around avoiding dietary triggers to maintain remission. Chat-based artificial intelligence (CB-AI) has shown great potential in enhancing patient education in medicine. We evaluate the role of CB-AI in patient education on dietary management of IBD. METHODS: Six questions evaluating important concepts about the dietary management of IBD which then were posed to three CB-AI models - ChatGPT, BingChat, and YouChat three different times. All responses were graded for appropriateness and reliability by two physicians using dietary information from the Crohn's and Colitis Foundation. The responses were graded as reliably appropriate, reliably inappropriate, and unreliable. The expert assessment of the reviewing physicians was validated by the joint probability of agreement for two raters. RESULTS: ChatGPT provided reliably appropriate responses to questions on dietary management of IBD more often than BingChat and YouChat. There were two questions that more than one CB-AI provided unreliable responses to. Each CB-AI provided examples within their responses, but the examples were not always appropriate. Whether the response was appropriate or not, CB-AIs mentioned consulting with an expert in the field. The inter-rater reliability was 88.9%. DISCUSSION: CB-AIs have the potential to improve patient education and outcomes but studies evaluating their appropriateness for various health conditions are sparse. Our study showed that CB-AIs have the ability to provide appropriate answers to most questions regarding the dietary management of IBD.",Naqvi HA; Delungahawatta T; Atarere JO; Bandaru SK; Barrow JB; Mattar MC
39810943,A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis.,2025,AJOG global reports,,,,"INTRODUCTION: The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them. OBJECTIVE: This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them. STUDY DESIGN: Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers' strength of agreement in ranking the LLMs' responses for each item. RESULTS: Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence. CONCLUSION: The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.",Cohen ND; Ho M; McIntire D; Smith K; Kho KA
39106968,Fact Check: Assessing the Response of ChatGPT to Alzheimer's Disease Myths.,2024,Journal of the American Medical Directors Association,,,,"INTRODUCTION: There are many myths regarding Alzheimer's disease (AD) that have been circulated on the internet, each exhibiting varying degrees of accuracy, inaccuracy, and misinformation. Large language models, such as ChatGPT, may be a valuable tool to help assess these myths for veracity and inaccuracy; however, they can induce misinformation as well. OBJECTIVE: This study assesses ChatGPT's ability to identify and address AD myths with reliable information. METHODS: We conducted a cross-sectional study of attending geriatric medicine clinicians' evaluation of ChatGPT (GPT 4.0) responses to 16 selected AD myths. We prompted ChatGPT to express its opinion on each myth and implemented a survey using REDCap to determine the degree to which clinicians agreed with the accuracy of each of ChatGPT's explanations. We also collected their explanations of any disagreements with ChatGPT's responses. We used a 5-category Likert-type scale with a score ranging from -2 to 2 to quantify clinicians' agreement in each aspect of the evaluation. RESULTS: The clinicians (n = 10) were generally satisfied with ChatGPT's explanations. Among the 16 myths, the clinicians were generally satisfied with these explanations, with [mean (SD) score of 1.1(+/-0.3)]. Most clinicians selected ""Agree"" or ""Strongly Agree"" for each statement. Some statements obtained a small number of ""Disagree"" responses. There were no ""Strongly Disagree"" responses. CONCLUSION: Most surveyed health care professionals acknowledged the potential value of ChatGPT in mitigating AD misinformation; however, the need for more refined and detailed explanations of the disease's mechanisms and treatments was highlighted.",Huang SS; Song Q; Beiting KJ; Duggan MC; Hines K; Murff H; Leung V; Powers J; Harvey TS; Malin B; Yin Z
37328321,ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine.,2023,Journal of pediatric urology,,,,"INTRODUCTION: There is currently no clear consensus on the standards for using large language models such as ChatGPT in academic medicine. Hence, we performed a scoping review of available literature to understand the current state of LLM use in medicine and to provide a guideline for future utilization in academia. MATERIALS AND METHODS: A scoping review of the literature was performed through a Medline search on February 16, 2023 using a combination of keywords including artificial intelligence, machine learning, natural language processing, generative pre-trained transformer, ChatGPT, and large language model. There were no restrictions to language or date of publication. Records not pertaining to LLMs were excluded. Records pertaining to LLM ChatBots and ChatGPT were identified and evaluated separately. Among the records pertaining to LLM ChatBots and ChatGPT, those that suggest recommendations for ChatGPT use in academia were utilized to create guideline statements for ChatGPT and LLM use in academic medicine. RESULTS: A total of 87 records were identified. 30 records were not pertaining to large language models and were excluded. 54 records underwent a full-text review for evaluation. There were 33 records related to LLM ChatBots or ChatGPT. DISCUSSION: From assessing these texts, five guideline statements for LLM use was developed: (1) ChatGPT/LLM cannot be cited as an author in scientific manuscripts; (2) If use of ChatGPT/LLM are considered for use in academic work, author(s) should have at least a basic understanding of what ChatGPT/LLM is; (3) Do not use ChatGPT/LLM to produce entirety of text in manuscripts; humans must be held accountable for use of ChatGPT/LLM and contents created by ChatGPT/LLM should be meticulously verified by humans; (4) ChatGPT/LLMs may be used for editing and refining of text; (5) Any use of ChatGPT/LLM should be transparent and should be clearly outlined in scientific manuscripts and acknowledged. CONCLUSION: Future authors should remain mindful of the potential impact their academic work may have on healthcare and continue to uphold the highest ethical standards and integrity when utilizing ChatGPT/LLM.",Kim JK; Chua M; Rickard M; Lorenzo A
40358604,Comparing Diagnostic Accuracy of ChatGPT to Clinical Diagnosis in General Surgery Consults: A Quantitative Analysis of Disease Diagnosis.,2025,Military medicine,,,,"INTRODUCTION: This study addressed the challenge of providing accurate and timely medical diagnostics in military health care settings with limited access to advanced diagnostic tools, such as those encountered in austere environments, remote locations, or during large-scale combat operations. The primary objective was to evaluate the utility of ChatGPT, an artificial intelligence (AI) language model, as a support tool for health care providers in clinical decision-making and early diagnosis. MATERIALS AND METHODS: The research used an observational cross-sectional cohort design and exploratory predictive techniques. The methodology involved collecting and analyzing data from clinical scenarios based on common general surgery diagnoses-acute appendicitis, acute cholecystitis, and diverticulitis. These scenarios incorporated age, gender, symptoms, vital signs, physical exam findings, laboratory values, medical and surgical histories, and current medication regimens as data inputs. All collected data were entered into a table for each diagnosis. These tables were then used for scenario creation, with scenarios written to reflect typical patient presentations for each diagnosis. Finally, each scenario was entered into ChatGPT (version 3.5) individually, with ChatGPT then being asked to provide the leading diagnosis for the condition based on the provided information. The output from ChatGPT was then compared to the expected diagnosis to assess the accuracy. RESULTS: A statistically significant difference between ChatGPT's diagnostic outcomes and clinical diagnoses for acute cholecystitis and diverticulitis was observed, with ChatGPT demonstrating inferior accuracy in controlled test scenarios. A secondary outcome analysis looked at the relationship between specific symptoms and diagnosis. The presence of these symptoms in incorrect diagnoses indicates that they may adversely impact ChatGPT's diagnostic decision-making, resulting in a higher likelihood of misdiagnosis. These results highlight AI's potential as a diagnostic support tool but underscore the importance of continued research to evaluate its performance in more complex and varied clinical scenarios. CONCLUSIONS: In summary, this study evaluated the diagnostic accuracy of ChatGPT in identifying three common surgical conditions (acute appendicitis, acute cholecystitis, and diverticulitis) using comprehensive patient data, including age, gender, medical history, medications, symptoms, vital signs, physical exam findings, and basic laboratory results. The hypothesis was that ChatGPT might display slightly lower accuracy rates than clinical diagnoses made by medical providers. The statistical analysis, which included Fisher's exact test, revealed a significant difference between ChatGPT's diagnostic outcomes and clinical diagnoses, particularly in acute cholecystitis and diverticulitis cases. Therefore, we reject the null hypothesis, as the results indicated that ChatGPT's diagnostic accuracy significantly differs from clinical diagnostics in the presented scenarios. However, ChatGPT's overall high accuracy suggests that it can reliably support clinicians, especially in environments where diagnostic resources are limited, and can serve as a valuable tool in military medicine.",Meier H; McMahon R; Hout B; Randles J; Aden J; Rizzo JA
38555637,Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology.,2024,Urologia internationalis,,,,"INTRODUCTION: This study assessed the potential of large language models (LLMs) as educational tools by evaluating their accuracy in answering questions across urological subtopics. METHODS: Three LLMs (ChatGPT-3.5, ChatGPT-4, and Bing AI) were examined in two testing rounds, separated by 48 h, using 100 Multiple-Choice Questions (MCQs) from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA), covering five different subtopics. The correct answer was defined as ""formal accuracy"" (FA) representing the designated single best answer (SBA) among four options. Alternative answers selected from LLMs, which may not necessarily be the SBA but are still deemed correct, were labeled as ""extended accuracy"" (EA). Their capacity to enhance the overall accuracy rate when combined with FA was examined. RESULTS: In two rounds of testing, the FA scores were achieved as follows: ChatGPT-3.5: 58% and 62%, ChatGPT-4: 63% and 77%, and BING AI: 81% and 73%. The incorporation of EA did not yield a significant enhancement in overall performance. The achieved gains for ChatGPT-3.5, ChatGPT-4, and BING AI were as a result 7% and 5%, 5% and 2%, and 3% and 1%, respectively (p &gt; 0.3). Within urological subtopics, LLMs showcased best performance in Pediatrics/Congenital and comparatively less effectiveness in Functional/BPS/Incontinence. CONCLUSION: LLMs exhibit suboptimal urology knowledge and unsatisfactory proficiency for educational purposes. The overall accuracy did not significantly improve when combining EA to FA. The error rates remained high ranging from 16 to 35%. Proficiency levels vary substantially across subtopics. Further development of medicine-specific LLMs is required before integration into urological training programs.",May M; Korner-Riffard K; Kollitsch L; Burger M; Brookman-May SD; Rauchenwald M; Marszalek M; Eredics K
39958944,Readability of Hospital Online Patient Education Materials Across Otolaryngology Specialties.,2025,Laryngoscope investigative otolaryngology,,,,"INTRODUCTION: This study evaluates the readability of online patient education materials (OPEMs) across otolaryngology subspecialties, hospital characteristics, and national otolaryngology organizations, while assessing AI alternatives. METHODS: Hospitals from the US News Best ENT list were queried for OPEMs describing a chosen surgery per subspecialty; the American Academy of Otolaryngology-Head and Neck Surgery (AAO), American Laryngological Association (ALA), Ear, Nose, and Throat United Kingdom (ENTUK), and the Canadian Society of Otolaryngology-Head and Neck Surgery (CSOHNS) were similarly queried. Google was queried for the top 10 links from hospitals per procedure. Ownership (private/public), presence of respective otolaryngology fellowships, region, and median household income (zip code) were collected. Readability was assessed using seven indices and averaged: Automated Readability Index (ARI), Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Readability (GFR), Simple Measure of Gobbledygook (SMOG), Coleman-Liau Readability Index (CLRI), and Linsear Write Readability Formula (LWRF). AI-generated materials from ChatGPT were compared for readability, accuracy, content, and tone. Analyses were conducted between subspecialties, against national organizations, NIH standard, and across demographic variables. RESULTS: Across 144 hospitals, OPEMs exceeded NIH readability standards, averaging at an 8th-12th grade level across subspecialties. In rhinology, facial plastics, and sleep medicine, hospital OPEMs had higher readability scores than ENTUK's materials (11.4 vs. 9.1, 10.4 vs. 7.2, 11.5 vs. 9.2, respectively; all p < 0.05), but lower than AAO (p = 0.005). ChatGPT-generated materials averaged a 6.8-grade level, demonstrating improved readability, especially with specialized prompting, compared to all hospital and organization OPEMs. CONCLUSION: OPEMs from all sources exceed the NIH readability standard. ENTUK serves as a benchmark for accessible language, while ChatGPT demonstrates the feasibility of producing more readable content. Otolaryngologists might consider using ChatGPT to generate patient-friendly materials, with caution, and advocate for national-level improvements in patient education readability.",Warrier A; Singh RP; Haleem A; Lee A; Mothy D; Patel A; Eloy JA; Manzi B
39487846,"Transforming emergency triage: A preliminary, scenario-based cross-sectional study comparing artificial intelligence models and clinical expertise for enhanced accuracy.",2024,Bratislavske lekarske listy,,,,"INTRODUCTION: This study examines triage judgments in emergency settings and compares the outcomes of artificial intelligence models for healthcare professionals. It discusses the disparities in precision rates between subjective evaluations by health professionals with objective assessments of AI systems. MATERIAL AND METHOD: For the analysis of the efficacy of emergency triage; 50 virtual patient scenarios had been created. Emergency medicine residents and other healthcare providers who had triage education were tasked with categorizing triage levels for virtual patient scenarios. Also artificial intelligence systems, tasked for resolving the same scenarios. All of them were asked to use three color-coded triage of the Republic of Turkey Ministry of Health. The answer keys were created by consensus of the researchers. In addition, Emergency medicine specialists were asked to evaluate the acuity level of each scenario in order to perform sub-analyses. RESULTS: The study consisted of 86 healthcare professionals, comprising 31 Emergency medicine residents (26.5%), 1 paramedic (0.9%), 5 emergency health technicians (4.3%), and 80 nurses (68.4%). Google Bard AI and OpenAI Chat GPT v.3.5 were used as artificial intelligence systems. The responses compared with the answer key to determine each groups efficacy. As planned the responses from healthcare professionals were analyzed individually for acuity level of scenarios. Emergency medicine residents and other groups of healthcare providers had significantly higher numbers of correct answers compared to Google Bard and Chat GPT (n=30.7 vs n=25.5). There was no significant difference between ChatGPT and Bard for low and high acuity scenarios (p=0.821)CONCLUSION: AI models can examine extensive data sets and make more accurate and quicker triage judgments with sophisticated algorithms. However, in this study, we found that the triage ability of artificial intelligence is not as sufficient as humans. A more efficient triage system can be developed by integrating artificial intelligence with human input, rather than solely relying on technology (Tab. 4, Ref. 41). Text in PDF www.elis.sk Keywords: emergency triage, AI applications, health technology, artificial intelligence, emergency management.",Eraybar S; Dal E; Aydin MO; Begenen M
39025818,"Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes.",2024,BMJ open,,,,"INTRODUCTION: Versatile large language models (LLMs) have the potential to augment diagnostic decision-making by assisting diagnosticians, thanks to their ability to engage in open-ended, natural conversations and their comprehensive knowledge access. Yet the novelty of LLMs in diagnostic decision-making introduces uncertainties regarding their impact. Clinicians unfamiliar with the use of LLMs in their professional context may rely on general attitudes towards LLMs more broadly, potentially hindering thoughtful use and critical evaluation of their input, leading to either over-reliance and lack of critical thinking or an unwillingness to use LLMs as diagnostic aids. To address these concerns, this study examines the influence on the diagnostic process and outcomes of interacting with an LLM compared with a human coach, and of prior training vs no training for interacting with either of these 'coaches'. Our findings aim to illuminate the potential benefits and risks of employing artificial intelligence (AI) in diagnostic decision-making. METHODS AND ANALYSIS: We are conducting a prospective, randomised experiment with N=158 fourth-year medical students from Charite Medical School, Berlin, Germany. Participants are asked to diagnose patient vignettes after being assigned to either a human coach or ChatGPT and after either training or no training (both between-subject factors). We are specifically collecting data on the effects of using either of these 'coaches' and of additional training on information search, number of hypotheses entertained, diagnostic accuracy and confidence. Statistical methods will include linear mixed effects models. Exploratory analyses of the interaction patterns and attitudes towards AI will also generate more generalisable knowledge about the role of AI in medicine. ETHICS AND DISSEMINATION: The Bern Cantonal Ethics Committee considered the study exempt from full ethical review (BASEC No: Req-2023-01396). All methods will be conducted in accordance with relevant guidelines and regulations. Participation is voluntary and informed consent will be obtained. Results will be published in peer-reviewed scientific medical journals. Authorship will be determined according to the International Committee of Medical Journal Editors guidelines.",Kammer JE; Hautz WE; Krummrey G; Sauter TC; Penders D; Birrenbach T; Bienefeld N
38665043,Evaluating ChatGPT's Utility in Medicine Guidelines Through Web Search Analysis.,2024,The Permanente journal,,,,"INTRODUCTION: With the rise of machine learning applications in health care, shifts in medical fields that rely on precise prognostic models and pattern detection tools are anticipated in the near future. Chat Generative Pretrained Transformer (ChatGPT) is a recent machine learning innovation known for producing text that mimics human conversation. To gauge ChatGPT's capability in addressing patient inquiries, the authors set out to juxtapose it with Google Search, America's predominant search engine. Their comparison focused on: 1) the top questions related to clinical practice guidelines from the American Academy of Family Physicians by category and subject; 2) responses to these prevalent questions; and 3) the top questions that elicited a numerical reply. METHODS: Utilizing a freshly installed Google Chrome browser (version 109.0.5414.119), the authors conducted a Google web search (www.google.com) on March 4, 2023, ensuring minimal influence from personalized search algorithms. Search phrases were derived from the clinical guidelines of the American Academy of Family Physicians. The authors prompted ChatGPT with: ""Search Google using the term '(refer to search terms)' and document the top four questions linked to the term."" The same 25 search terms were employed. The authors cataloged the primary 4 questions and their answers for each term, resulting in 100 questions and answers. RESULTS: Of the 100 questions, 42% (42 questions) were consistent across all search terms. ChatGPT predominantly sourced from academic (38% vs 15%, p = 0.0002) and government (50% vs 39%, p = 0.12) domains, whereas Google web searches leaned toward commercial sources (32% vs 11%, p = 0.0002). Thirty-nine percent (39 questions) of the questions yielded divergent answers between the 2 platforms. Notably, 16 of the 39 distinct answers from ChatGPT lacked a numerical reply, instead advising a consultation with a medical professional for health guidance. CONCLUSION: Google Search and ChatGPT present varied questions and answers for both broad and specific queries. Both patients and doctors should exercise prudence when considering ChatGPT as a digital health adviser. It's essential for medical professionals to assist patients in accurately communicating their online discoveries and ensuing inquiries for a comprehensive discussion.",Dubin JA; Bains SS; Hameed D; Chen Z; Gaertner E; Nace J; Mont MA; Delanois RE
40289627,Evaluating Large Language Models on Aerospace Medicine Principles.,2025,Wilderness & environmental medicine,,,,"IntroductionLarge language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.MethodTo better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.ResultsWhen queried with 857 free-response questions from Aerospace Medicine Boards Questions and Answers, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1-5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.ConclusionThere is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.",Anderson KD; Davis CA; Pickett SM; Pohlen MS
38502861,ChatGPT in medicine: prospects and challenges: a review article.,2024,"International journal of surgery (London, England)",,,,"It has been a year since the launch of Chat Generator Pre-Trained Transformer (ChatGPT), a generative artificial intelligence (AI) program. The introduction of this cross-generational product initially brought a huge shock to people with its incredible potential and then aroused increasing concerns among people. In the field of medicine, researchers have extensively explored the possible applications of ChatGPT and achieved numerous satisfactory results. However, opportunities and issues always come together. Problems have also been exposed during the applications of ChatGPT, requiring cautious handling, thorough consideration, and further guidelines for safe use. Here, the authors summarized the potential applications of ChatGPT in the medical field, including revolutionizing healthcare consultation, assisting patient management and treatment, transforming medical education, and facilitating clinical research. Meanwhile, the authors also enumerated researchers' concerns arising along with its broad and satisfactory applications. As it is irreversible that AI will gradually permeate every aspect of modern life, the authors hope that this review can not only promote people's understanding of the potential applications of ChatGPT in the future but also remind them to be more cautious about this ""Pandora's Box"" in the medical field. It is necessary to establish normative guidelines for its safe use in the medical field as soon as possible.",Tan S; Xin X; Wu D
37761715,Enhancing Kidney Transplant Care through the Integration of Chatbot.,2023,"Healthcare (Basel, Switzerland)",,,,"Kidney transplantation is a critical treatment option for end-stage kidney disease patients, offering improved quality of life and increased survival rates. However, the complexities of kidney transplant care necessitate continuous advancements in decision making, patient communication, and operational efficiency. This article explores the potential integration of a sophisticated chatbot, an AI-powered conversational agent, to enhance kidney transplant practice and potentially improve patient outcomes. Chatbots and generative AI have shown promising applications in various domains, including healthcare, by simulating human-like interactions and generating contextually appropriate responses. Noteworthy AI models like ChatGPT by OpenAI, BingChat by Microsoft, and Bard AI by Google exhibit significant potential in supporting evidence-based research and healthcare decision making. The integration of chatbots in kidney transplant care may offer transformative possibilities. As a clinical decision support tool, it could provide healthcare professionals with real-time access to medical literature and guidelines, potentially enabling informed decision making and improved knowledge dissemination. Additionally, the chatbot has the potential to facilitate patient education by offering personalized and understandable information, addressing queries, and providing guidance on post-transplant care. Furthermore, under clinician or transplant pharmacist supervision, it has the potential to support post-transplant care and medication management by analyzing patient data, which may lead to tailored recommendations on dosages, monitoring schedules, and potential drug interactions. However, to fully ascertain its effectiveness and safety in these roles, further studies and validation are required. Its integration with existing clinical decision support systems may enhance risk stratification and treatment planning, contributing to more informed and efficient decision making in kidney transplant care. Given the importance of ethical considerations and bias mitigation in AI integration, future studies may evaluate long-term patient outcomes, cost-effectiveness, user experience, and the generalizability of chatbot recommendations. By addressing these factors and potentially leveraging AI capabilities, the integration of chatbots in kidney transplant care holds promise for potentially improving patient outcomes, enhancing decision making, and fostering the equitable and responsible use of AI in healthcare.",Garcia Valencia OA; Thongprayoon C; Jadlowiec CC; Mao SA; Miao J; Cheungpasitporn W
37900350,An approach for collaborative development of a federated biomedical knowledge graph-based question-answering system: Question-of-the-Month challenges.,2023,Journal of clinical and translational science,,,,"Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium, we have developed a knowledge graph-based question-answering system designed to augment human reasoning and accelerate translational scientific discovery: the Translator system. We have applied the Translator system to answer biomedical questions in the context of a broad array of diseases and syndromes, including Fanconi anemia, primary ciliary dyskinesia, multiple sclerosis, and others. A variety of collaborative approaches have been used to research and develop the Translator system. One recent approach involved the establishment of a monthly ""Question-of-the-Month (QotM) Challenge"" series. Herein, we describe the structure of the QotM Challenge; the six challenges that have been conducted to date on drug-induced liver injury, cannabidiol toxicity, coronavirus infection, diabetes, psoriatic arthritis, and ATP1A3-related phenotypes; the scientific insights that have been gleaned during the challenges; and the technical issues that were identified over the course of the challenges and that can now be addressed to foster further development of the prototype Translator system. We close with a discussion on Large Language Models such as ChatGPT and highlight differences between those models and the Translator system.",Fecho K; Bizon C; Issabekova T; Moxon S; Thessen AE; Abdollahi S; Baranzini SE; Belhu B; Byrd WE; Chung L; Crouse A; Duby MP; Ferguson S; Foksinska A; Forero L; Friedman J; Gardner V; Glusman G; Hadlock J; Hanspers K; Hinderer E; Hobbs C; Hyde G; Huang S; Koslicki D; Mease P; Muller S; Mungall CJ; Ramsey SA; Roach J; Rubin I; Schurman SH; Shalev A; Smith B; Soman K; Stemann S; Su AI; Ta C; Watkins PB; Williams MD; Wu C; Xu CH
40140500,Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics.,2025,Scientific reports,,,,"Large language model chatbots such as ChatGPT have shown the potential in assisting health professionals in emergency departments (EDs). However, the diagnostic accuracy of newer ChatGPT models remains unclear. This retrospective study evaluated the diagnostic performance of various ChatGPT models-including GPT-3.5, GPT-4, GPT-4o, and o1 series-in predicting diagnoses for ED patients (n = 30) and examined the impact of explicitly invoking reasoning (thoughts). Earlier models, such as GPT-3.5, demonstrated high accuracy for top-three differential diagnoses (80.0% in accuracy) but underperformed in identifying leading diagnoses (47.8%) compared to newer models such as chatgpt-4o-latest (60%, p < 0.01) and o1-preview (60%, p < 0.01). Asking for thoughts to be provided significantly enhanced the performance on predicting leading diagnosis for 4o models such as 4o-2024-0513 (from 45.6 to 56.7%; p = 0.03) and 4o-mini-2024-07-18 (from 54.4 to 60.0%; p = 0.04) but had minimal impact on o1-mini and o1-preview. In challenging cases, such as pneumonia without fever, all models generally failed to predict the correct diagnosis, indicating atypical presentations as a major limitation for ED application of current ChatGPT models.",Wang J; Shue K; Liu L; Hu G
40206998,Fine-Tuning Large Language Models for Specialized Use Cases.,2025,Mayo Clinic proceedings. Digital health,,,,"Large language models (LLMs) are a type of artificial intelligence, which operate by predicting and assembling sequences of words that are statistically likely to follow from a given text input. With this basic ability, LLMs are able to answer complex questions and follow extremely complex instructions. Products created using LLMs such as ChatGPT by OpenAI and Claude by Anthropic have created a huge amount of traction and user engagements and revolutionized the way we interact with technology, bringing a new dimension to human-computer interaction. Fine-tuning is a process in which a pretrained model, such as an LLM, is further trained on a custom data set to adapt it for specialized tasks or domains. In this review, we outline some of the major methodologic approaches and techniques that can be used to fine-tune LLMs for specialized use cases and enumerate the general steps required for carrying out LLM fine-tuning. We then illustrate a few of these methodologic approaches by describing several specific use cases of fine-tuning LLMs across medical subspecialties. Finally, we close with a consideration of some of the benefits and limitations associated with fine-tuning LLMs for specialized use cases, with an emphasis on specific concerns in the field of medicine.",Anisuzzaman DM; Malins JG; Friedman PA; Attia ZI
37816837,The future landscape of large language models in medicine.,2023,Communications medicine,,,,"Large language models (LLMs) are artificial intelligence (AI) tools specifically trained to process and generate text. LLMs attracted substantial public attention after OpenAI's ChatGPT was made publicly available in November 2022. LLMs can often answer questions, summarize, paraphrase and translate text on a level that is nearly indistinguishable from human capabilities. The possibility to actively interact with models like ChatGPT makes LLMs attractive tools in various fields, including medicine. While these models have the potential to democratize medical knowledge and facilitate access to healthcare, they could equally distribute misinformation and exacerbate scientific misconduct due to a lack of accountability and transparency. In this article, we provide a systematic and comprehensive overview of the potentials and limitations of LLMs in clinical practice, medical research and medical education.",Clusmann J; Kolbinger FR; Muti HS; Carrero ZI; Eckardt JN; Laleh NG; Loffler CML; Schwarzkopf SC; Unger M; Veldhuizen GP; Wagner SJ; Kather JN
38312244,"Comparison of large language models in management advice for melanoma: Google's AI BARD, BingAI and ChatGPT.",2024,Skin health and disease,,,,"Large language models (LLMs) are emerging artificial intelligence (AI) technology refining research and healthcare. Their use in medicine has seen numerous recent applications. One area where LLMs have shown particular promise is in the provision of medical information and guidance to practitioners. This study aims to assess three prominent LLMs-Google's AI BARD, BingAI and ChatGPT-4 in providing management advice for melanoma by comparing their responses to current clinical guidelines and existing literature. Five questions on melanoma pathology were prompted to three LLMs. A panel of three experienced Board-certified plastic surgeons evaluated the responses for reliability using reliability matrix (Flesch Reading Ease Score, the Flesch-Kincaid Grade Level and the Coleman-Liau Index), suitability (modified DISCERN score) and comparing them to existing guidelines. t-Test was performed to calculate differences in mean readability and reliability scores between LLMs and p value <0.05 was considered statistically significant. The mean readability scores across three LLMs were same. ChatGPT exhibited superiority with a Flesch Reading Ease Score of 35.42 (+/-21.02), Flesch-Kincaid Grade Level of 11.98 (+/-4.49) and Coleman-Liau Index of 12.00 (+/-5.10), however all of these were insignificant (p > 0.05). Suitability-wise using DISCERN score, ChatGPT 58 (+/-6.44) significantly (p = 0.04) outperformed BARD 36.2 (+/-34.06) and was insignificant to BingAI's 49.8 (+/-22.28). This study demonstrates that ChatGPT marginally outperforms BARD and BingAI in providing reliable, evidence-based clinical advice, but they still face limitations in depth and specificity. Future research should improve LLM performance by integrating specialized databases and expert knowledge to support patient-centred care.",Mu X; Lim B; Seth I; Xie Y; Cevik J; Sofiadellis F; Hunter-Smith DJ; Rozen WM
39177261,Harnessing large language models' zero-shot and few-shot learning capabilities for regulatory research.,2024,Briefings in bioinformatics,,,,"Large language models (LLMs) are sophisticated AI-driven models trained on vast sources of natural language data. They are adept at generating responses that closely mimic human conversational patterns. One of the most notable examples is OpenAI's ChatGPT, which has been extensively used across diverse sectors. Despite their flexibility, a significant challenge arises as most users must transmit their data to the servers of companies operating these models. Utilizing ChatGPT or similar models online may inadvertently expose sensitive information to the risk of data breaches. Therefore, implementing LLMs that are open source and smaller in scale within a secure local network becomes a crucial step for organizations where ensuring data privacy and protection has the highest priority, such as regulatory agencies. As a feasibility evaluation, we implemented a series of open-source LLMs within a regulatory agency's local network and assessed their performance on specific tasks involving extracting relevant clinical pharmacology information from regulatory drug labels. Our research shows that some models work well in the context of few- or zero-shot learning, achieving performance comparable, or even better than, neural network models that needed thousands of training samples. One of the models was selected to address a real-world issue of finding intrinsic factors that affect drugs' clinical exposure without any training or fine-tuning. In a dataset of over 700 000 sentences, the model showed a 78.5% accuracy rate. Our work pointed to the possibility of implementing open-source LLMs within a secure local network and using these models to perform various natural language processing tasks when large numbers of training examples are unavailable.",Meshkin H; Zirkle J; Arabidarrehdor G; Chaturbedi A; Chakravartula S; Mann J; Thrasher B; Li Z
39917061,Navigating the potential and pitfalls of large language models in patient-centered medication guidance and self-decision support.,2025,Frontiers in medicine,,,,"Large Language Models (LLMs) are transforming patient education in medication management by providing accessible information to support healthcare decision-making. Building on our recent scoping review of LLMs in patient education, this perspective examines their specific role in medication guidance. These artificial intelligence (AI)-driven tools can generate comprehensive responses about drug interactions, side effects, and emergency care protocols, potentially enhancing patient autonomy in medication decisions. However, significant challenges exist, including the risk of misinformation and the complexity of providing accurate drug information without access to individual patient data. Safety concerns are particularly acute when patients rely solely on AI-generated advice for self-medication decisions. This perspective analyzes current capabilities, examines critical limitations, and raises questions regarding the possible integration of LLMs in medication guidance. We emphasize the need for regulatory oversight to ensure these tools serve as supplements to, rather than replacements for, professional healthcare guidance.",Aydin S; Karabacak M; Vlachos V; Margetis K
40097720,Large language model agents can use tools to perform clinical calculations.,2025,NPJ digital medicine,,,,"Large language models (LLMs) can answer expert-level questions in medicine but are prone to hallucinations and arithmetic errors. Early evidence suggests LLMs cannot reliably perform clinical calculations, limiting their potential integration into clinical workflows. We evaluated ChatGPT's performance across 48 medical calculation tasks, finding incorrect responses in one-third of trials (n = 212). We then assessed three forms of agentic augmentation: retrieval-augmented generation, a code interpreter tool, and a set of task-specific calculation tools (OpenMedCalc) across 10,000 trials. Models with access to task-specific tools showed the greatest improvement, with LLaMa and GPT-based models demonstrating a 5.5-fold (88% vs 16%) and 13-fold (64% vs 4.8%) reduction in incorrect responses, respectively, compared to the unimproved models. Our findings suggest that integration of machine-readable, task-specific tools may help overcome LLMs' limitations in medical calculations.",Goodell AJ; Chu SN; Rouholiman D; Chu LF
37460753,Large language models in medicine.,2023,Nature medicine,,,,"Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine. LLM chatbots have already been deployed in a range of biomedical contexts, with impressive but mixed results. This review acts as a primer for interested clinicians, who will determine if and how LLM technology is used in healthcare for the benefit of patients and practitioners.",Thirunavukarasu AJ; Ting DSJ; Elangovan K; Gutierrez L; Tan TF; Ting DSW
38428889,"Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.",2024,Medicine,,,,"Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.",Lee GU; Hong DY; Kim SY; Kim JW; Lee YH; Park SO; Lee KR
38098921,Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps.,2023,Cureus,,,,"Large language models (LLMs) have broad potential applications in medicine, such as aiding with education, providing reassurance to patients, and supporting clinical decision-making. However, there is a notable gap in understanding their applicability and performance in the surgical domain and how their performance varies across specialties. This paper aims to evaluate the performance of LLMs in answering surgical questions relevant to clinical practice and to assess how this performance varies across different surgical specialties. We used the MedMCQA dataset, a large-scale multi-choice question-answer (MCQA) dataset consisting of clinical questions across all areas of medicine. We extracted the relevant 23,035 surgical questions and submitted them to the popular LLMs Generative Pre-trained Transformers (GPT)-3.5 and GPT-4 (OpenAI OpCo, LLC, San Francisco, CA). Generative Pre-trained Transformer is a large language model that can generate human-like text by predicting subsequent words in a sentence based on the context of the words that come before it. It is pre-trained on a diverse range of texts and can perform a variety of tasks, such as answering questions, without needing task-specific training. The question-answering accuracy of GPT was calculated and compared between the two models and across surgical specialties. Both GPT-3.5 and GPT-4 achieved accuracies of 53.3% and 64.4%, respectively, on surgical questions, showing a statistically significant difference in performance. When compared to their performance on the full MedMCQA dataset, the two models performed differently: GPT-4 performed worse on surgical questions than on the dataset as a whole, while GPT-3.5 showed the opposite pattern. Significant variations in accuracy were also observed across different surgical specialties, with strong performances in anatomy, vascular, and paediatric surgery and worse performances in orthopaedics, ENT, and neurosurgery. Large language models exhibit promising capabilities in addressing surgical questions, although the variability in their performance between specialties cannot be ignored. The lower performance of the latest GPT-4 model on surgical questions relative to questions across all medicine highlights the need for targeted improvements and continuous updates to ensure relevance and accuracy in surgical applications. Further research and continuous monitoring of LLM performance in surgical domains are crucial to fully harnessing their potential and mitigating the risks of misinformation.",Murphy Lonergan R; Curry J; Dhas K; Simmons BI
37378099,Embracing Large Language Models for Medical Applications: Opportunities and Challenges.,2023,Cureus,,,,"Large language models (LLMs) have the potential to revolutionize the field of medicine by, among other applications, improving diagnostic accuracy and supporting clinical decision-making. However, the successful integration of LLMs in medicine requires addressing challenges and considerations specific to the medical domain. This viewpoint article provides a comprehensive overview of key aspects for the successful implementation of LLMs in medicine, including transfer learning, domain-specific fine-tuning, domain adaptation, reinforcement learning with expert input, dynamic training, interdisciplinary collaboration, education and training, evaluation metrics, clinical validation, ethical considerations, data privacy, and regulatory frameworks. By adopting a multifaceted approach and fostering interdisciplinary collaboration, LLMs can be developed, validated, and integrated into medical practice responsibly, effectively, and ethically, addressing the needs of various medical disciplines and diverse patient populations. Ultimately, this approach will ensure that LLMs enhance patient care and improve overall health outcomes for all.",Karabacak M; Margetis K
39001657,Enhancing Care for Older Adults and Dementia Patients With Large Language Models: Proceedings of the National Institute on Aging-Artificial Intelligence & Technology Collaboratory for Aging Research Symposium.,2024,"The journals of gerontology. Series A, Biological sciences and medical sciences",,,,"Large Language Models (LLMs) stand on the brink of reshaping the field of aging and dementia care, challenging the one-size-fits-all paradigm with their capacity for precision medicine and individualized treatment strategies. The ""Large Pre-Trained Models with a Focus on AD/ADRD and Healthy Aging"" symposium, organized by the National Institute on Aging and the Johns Hopkins Artificial Intelligence & Technology Collaboratory for Aging Research, served as a platform for exploring this potential. The symposium brought together diverse experts to discuss the integration of LLMs in aging and dementia care. They highlighted the roles LLMs can play in clinical decision support and predictive analytics, while also addressing critical ethical concerns including bias, privacy, and the responsible use of artificial intelligence (AI). The discussions focused on the need to balance technological advancement with ethical considerations in AI deployment. In conclusion, the symposium projected a future where LLMs not only revolutionize healthcare practices but also pose significant challenges that require careful navigation.",Abadir PM; Battle A; Walston JD; Chellappa R
38103973,Evaluation of ChatGPT and Google Bard Using Prompt Engineering in Cancer Screening Algorithms.,2024,Academic radiology,,,,"Large language models (LLMs) such as ChatGPT and Bard have emerged as powerful tools in medicine, showcasing strong results in tasks such as radiology report translations and research paper drafting. While their implementation in clinical practice holds promise, their response accuracy remains variable. This study aimed to evaluate the accuracy of ChatGPT and Bard in clinical decision-making based on the American College of Radiology Appropriateness Criteria for various cancers. Both LLMs were evaluated in terms of their responses to open-ended (OE) and select-all-that-apply (SATA) prompts. Furthermore, the study incorporated prompt engineering (PE) techniques to enhance the accuracy of LLM outputs. The results revealed similar performances between ChatGPT and Bard on OE prompts, with ChatGPT exhibiting marginally higher accuracy in SATA scenarios. The introduction of PE also marginally improved LLM outputs in OE prompts but did not enhance SATA responses. The results highlight the potential of LLMs in aiding clinical decision-making processes, especially when guided by optimally engineered prompts. Future studies in diverse clinical situations are imperative to better understand the impact of LLMs in radiology.",Nguyen D; Swanson D; Newbury A; Kim YH
38726506,ChatGPT's performance in dentistry and allergyimmunology assessments: a comparative study.,2023,Swiss dental journal,,,,"Large language models (LLMs) such as ChatGPT have potential applications in healthcare, including dentistry. Priming, the practice of providing LLMs with initial, relevant information, is an approach to improve their output quality. This study aimed to evaluate the performance of ChatGPT 3 and ChatGPT 4 on self-assessment questions for dentistry, through the Swiss Federal Licensing Examination in Dental Medicine (SFLEDM), and allergy and clinical immunology, through the European Examination in Allergy and Clinical Immunology (EEAACI). The second objective was to assess the impact of priming on ChatGPT's performance. The SFLEDM and EEAACI multiple-choice questions from the University of Bern's Institute for Medical Education platform were administered to both ChatGPT versions, with and without priming. Performance was analyzed based on correct responses. The statistical analysis included Wilcoxon rank sum tests (alpha=0.05). The average accuracy rates in the SFLEDM and EEAACI assessments were 63.3% and 79.3%, respectively. Both ChatGPT versions performed better on EEAACI than SFLEDM, with ChatGPT 4 outperforming ChatGPT 3 across all tests. ChatGPT 3's performance exhibited a significant improvement with priming for both EEAACI (p=0.017) and SFLEDM (p=0.024) assessments. For ChatGPT 4, the priming effect was significant only in the SFLEDM assessment (p=0.038). The performance disparity between SFLEDM and EEAACI assessments underscores ChatGPT's varying proficiency across different medical domains, likely tied to the nature and amount of training data available in each field. Priming can be a tool for enhancing output, especially in earlier LLMs. Advancements from ChatGPT 3 to 4 highlight the rapid developments in LLM technology. Yet, their use in critical fields such as healthcare must remain cautious owing to LLMs' inherent limitations and risks.",Fuchs A; Trachsel T; Weiger R; Eggmann F
37799027,ChatGPT's performance in dentistry and allergy-immunology assessments: a comparative study.,2023,Swiss dental journal,,,,"Large language models (LLMs) such as ChatGPT have potential applications in healthcare, including dentistry. Priming, the practice of providing LLMs with initial, relevant information, is an approach to improve their output quality. This study aimed to evaluate the performance of ChatGPT 3 and ChatGPT 4 on self-assessment questions for dentistry, through the Swiss Federal Licensing Examination in Dental Medicine (SFLEDM), and allergy and clinical immunology, through the European Examination in Allergy and Clinical Immunology (EEAACI). The second objective was to assess the impact of priming on ChatGPT's performance. The SFLEDM and EEAACI multiple-choice questions from the University of Bern's Institute for Medical Education platform were administered to both ChatGPT versions, with and without priming. Performance was analyzed based on correct responses. The statistical analysis included Wilcoxon rank sum tests (alpha=0.05). The average accuracy rates in the SFLEDM and EEAACI assessments were 63.3% and 79.3%, respectively. Both ChatGPT versions performed better on EEAACI than SFLEDM, with ChatGPT 4 outperforming ChatGPT 3 across all tests. ChatGPT 3's performance exhibited a significant improvement with priming for both EEAACI (p=0.017) and SFLEDM (p=0.024) assessments. For ChatGPT 4, the priming effect was significant only in the SFLEDM assessment (p=0.038). The performance disparity between SFLEDM and EEAACI assessments underscores ChatGPT's varying proficiency across different medical domains, likely tied to the nature and amount of training data available in each field. Priming can be a tool for enhancing output, especially in earlier LLMs. Advancements from ChatGPT 3 to 4 highlight the rapid developments in LLM technology. Yet, their use in critical fields such as healthcare must remain cautious owing to LLMs' inherent limitations and risks.",Fuchs A; Trachsel T; Weiger R; Eggmann F
37855948,Examining the Potential of ChatGPT on Biomedical Information Retrieval: Fact-Checking Drug-Disease Associations.,2024,Annals of biomedical engineering,,,,"Large language models (LLMs) such as ChatGPT have recently attracted significant attention due to their impressive performance on many real-world tasks. These models have also demonstrated the potential in facilitating various biomedical tasks. However, little is known of their potential in biomedical information retrieval, especially identifying drug-disease associations. This study aims to explore the potential of ChatGPT, a popular LLM, in discerning drug-disease associations. We collected 2694 true drug-disease associations and 5662 false drug-disease pairs. Our approach involved creating various prompts to instruct ChatGPT in identifying these associations. Under varying prompt designs, ChatGPT's capability to identify drug-disease associations with an accuracy of 74.6-83.5% and 96.2-97.6% for the true and false pairs, respectively. This study shows that ChatGPT has the potential in identifying drug-disease associations and may serve as a helpful tool in searching pharmacy-related information. However, the accuracy of its insights warrants comprehensive examination before its implementation in medical practice.",Gao Z; Li L; Ma S; Wang Q; Hemphill L; Xu R
39722188,Application of large language models in disease diagnosis and treatment.,2025,Chinese medical journal,,,,"Large language models (LLMs) such as ChatGPT, Claude, Llama, and Qwen are emerging as transformative technologies for the diagnosis and treatment of various diseases. With their exceptional long-context reasoning capabilities, LLMs are proficient in clinically relevant tasks, particularly in medical text analysis and interactive dialogue. They can enhance diagnostic accuracy by processing vast amounts of patient data and medical literature and have demonstrated their utility in diagnosing common diseases and facilitating the identification of rare diseases by recognizing subtle patterns in symptoms and test results. Building on their image-recognition abilities, multimodal LLMs (MLLMs) show promising potential for diagnosis based on radiography, chest computed tomography (CT), electrocardiography (ECG), and common pathological images. These models can also assist in treatment planning by suggesting evidence-based interventions and improving clinical decision support systems through integrated analysis of patient records. Despite these promising developments, significant challenges persist regarding the use of LLMs in medicine, including concerns regarding algorithmic bias, the potential for hallucinations, and the need for rigorous clinical validation. Ethical considerations also underscore the importance of maintaining the function of supervision in clinical practice. This paper highlights the rapid advancements in research on the diagnostic and therapeutic applications of LLMs across different medical disciplines and emphasizes the importance of policymaking, ethical supervision, and multidisciplinary collaboration in promoting more effective and safer clinical applications of LLMs. Future directions include the integration of proprietary clinical knowledge, the investigation of open-source and customized models, and the evaluation of real-time effects in clinical diagnosis and treatment practices.",Yang X; Li T; Su Q; Liu Y; Kang C; Lyu Y; Zhao L; Nie Y; Pan Y
38801706,Evidence-Based Learning Strategies in Medicine Using AI.,2024,JMIR medical education,,,,"Large language models (LLMs), like ChatGPT, are transforming the landscape of medical education. They offer a vast range of applications, such as tutoring (personalized learning), patient simulation, generation of examination questions, and streamlined access to information. The rapid advancement of medical knowledge and the need for personalized learning underscore the relevance and timeliness of exploring innovative strategies for integrating artificial intelligence (AI) into medical education. In this paper, we propose coupling evidence-based learning strategies, such as active recall and memory cues, with AI to optimize learning. These strategies include the generation of tests, mnemonics, and visual cues.",Arango-Ibanez JP; Posso-Nunez JA; Diaz-Solorzano JP; Cruz-Suarez G
37501529,"Application of ChatGPT in Routine Diagnostic Pathology: Promises, Pitfalls, and Potential Future Directions.",2024,Advances in anatomic pathology,,,,"Large Language Models are forms of artificial intelligence that use deep learning algorithms to decipher large amounts of text and exhibit strong capabilities like question answering and translation. Recently, an influx of Large Language Models has emerged in the medical and academic discussion, given their potential widespread application to improve patient care and provider workflow. One application that has gained notable recognition in the literature is ChatGPT, which is a natural language processing ""chatbot"" technology developed by the artificial intelligence development software company OpenAI. It learns from large amounts of text data to generate automated responses to inquiries in seconds. In health care and academia, chatbot systems like ChatGPT have gained much recognition recently, given their potential to become functional, reliable virtual assistants. However, much research is required to determine the accuracy, validity, and ethical concerns of the integration of ChatGPT and other chatbots into everyday practice. One such field where little information and research on the matter currently exists is pathology. Herein, we present a literature review of pertinent articles regarding the current status and understanding of ChatGPT and its potential application in routine diagnostic pathology. In this review, we address the promises, possible pitfalls, and future potential of this application. We provide examples of actual conversations conducted with the chatbot technology that mimic hypothetical but practical diagnostic pathology scenarios that may be encountered in routine clinical practice. On the basis of this experience, we observe that ChatGPT and other chatbots already have a remarkable ability to distill and summarize, within seconds, vast amounts of publicly available data and information to assist in laying a foundation of knowledge on a specific topic. We emphasize that, at this time, any use of such knowledge at the patient care level in clinical medicine must be carefully vetted through established sources of medical information and expertise. We suggest and anticipate that with the ever-expanding knowledge base required to reliably practice personalized, precision anatomic pathology, improved technologies like future versions of ChatGPT (and other chatbots) enabled by expanded access to reliable, diverse data, might serve as a key ally to the diagnostician. Such technology has real potential to further empower the time-honored paradigm of histopathologic diagnoses based on the integrative cognitive assessment of clinical, gross, and microscopic findings and ancillary immunohistochemical and molecular studies at a time of exploding biomedical knowledge.",Schukow C; Smith SC; Landgrebe E; Parasuraman S; Folaranmi OO; Paner GP; Amin MB
36589923,Rapamycin in the context of Pascal's Wager: generative pre-trained transformer perspective.,2022,Oncoscience,,,,"Large language models utilizing transformer neural networks and other deep learning architectures demonstrated unprecedented results in many tasks previously accessible only to human intelligence. In this article, we collaborate with ChatGPT, an AI model developed by OpenAI to speculate on the applications of Rapamycin, in the context of Pascal's Wager philosophical argument commonly utilized to justify the belief in god. In response to the query ""Write an exhaustive research perspective on why taking Rapamycin may be more beneficial than not taking Rapamycin from the perspective of Pascal's wager"" ChatGPT provided the pros and cons for the use of Rapamycin considering the preclinical evidence of potential life extension in animals. This article demonstrates the potential of ChatGPT to produce complex philosophical arguments and should not be used for any off-label use of Rapamycin.",Zhavoronkov A
38093584,The use of large language models in medicine: proceeding with caution.,2024,Current medical research and opinion,,,,"Large language models, like ChatGPT and Bard, have potential clinical applications due to their ability to generate conversational responses and encode medical knowledge. However, their clinical adoption faces challenges including hallucinations, lack of transparency, and lack of consistency. Ethicolegal concerns surrounding patient consent, legal liability, and data privacy further complicate matters. Despite their promise, an optimistic but cautious approach is essential for the safe integration of large language models into clinical settings.",Deng J; Zubair A; Park YJ; Affan E; Zuo QK
37286844,Harnessing the Potential of ChatGPT in Breast Reconstruction: A Revolution in Patient Communication and Education.,2023,Aesthetic plastic surgery,,,,"Level of Evidence IV This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Lanzano G
39470819,"Comment on Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude.",2024,Aesthetic plastic surgery,,,,"Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Kleebayoon A; Wiwanitkit V
39330905,Enhancing the Interpretability of Malaria and Typhoid Diagnosis with Explainable AI and Large Language Models.,2024,Tropical medicine and infectious disease,,,,"Malaria and Typhoid fever are prevalent diseases in tropical regions, and both are exacerbated by unclear protocols, drug resistance, and environmental factors. Prompt and accurate diagnosis is crucial to improve accessibility and reduce mortality rates. Traditional diagnosis methods cannot effectively capture the complexities of these diseases due to the presence of similar symptoms. Although machine learning (ML) models offer accurate predictions, they operate as ""black boxes"" with non-interpretable decision-making processes, making it challenging for healthcare providers to comprehend how the conclusions are reached. This study employs explainable AI (XAI) models such as Local Interpretable Model-agnostic Explanations (LIME), and Large Language Models (LLMs) like GPT to clarify diagnostic results for healthcare workers, building trust and transparency in medical diagnostics by describing which symptoms had the greatest impact on the model's decisions and providing clear, understandable explanations. The models were implemented on Google Colab and Visual Studio Code because of their rich libraries and extensions. Results showed that the Random Forest model outperformed the other tested models; in addition, important features were identified with the LIME plots while ChatGPT 3.5 had a comparative advantage over other LLMs. The study integrates RF, LIME, and GPT in building a mobile app to enhance the interpretability and transparency in malaria and typhoid diagnosis system. Despite its promising results, the system's performance is constrained by the quality of the dataset. Additionally, while LIME and GPT improve transparency, they may introduce complexities in real-time deployment due to computational demands and the need for internet service to maintain relevance and accuracy. The findings suggest that AI-driven diagnostic systems can significantly enhance healthcare delivery in environments with limited resources, and future works can explore the applicability of this framework to other medical conditions and datasets.",Attai K; Ekpenyong M; Amannah C; Asuquo D; Ajuga P; Obot O; Johnson E; John A; Maduka O; Akwaowo C; Uzoka FM
37332004,Success Through Simplicity: What Other Artificial Intelligence Applications in Medicine Should Learn from History and ChatGPT.,2023,Annals of biomedical engineering,,,,"Many artificial intelligence (AI) algorithms have been developed for medical practice, but few have led to clinically used products. The recent hype of ChatGPT shows us that simple, user-friendly interfaces are one major factor in the applications' popularity. The majority of AI-based applications in clinical practice are still far from simple-to-use applications with user-friendly interfaces. Therefore, simplifying operations is one key to AI-based medical applications' success.",Sedaghat S
38737711,How AI drives innovation in cardiovascular medicine.,2024,Frontiers in cardiovascular medicine,,,,"Medicine is entering a new era in which artificial intelligence (AI) and deep learning have a measurable impact on patient care. This impact is especially evident in cardiovascular medicine. While the purpose of this short opinion paper is not to provide an in-depth review of the many applications of AI in cardiovascular medicine, we summarize some of the important advances that have taken place in this domain.",Cerrato PL; Halamka JD
37082496,Neuroblastoma Masquerading as a Septic Hip Infection in a Three-Year-Old.,2023,Cureus,,,,"Metastatic neuroblastoma to the bone and septic joint shares the same incidence in age and clinical symptomology. Here we discuss a three-year-old male who presented with anemia, persistent hip pain, and a refusal to bear weight. A thorough evaluation based on a broad differential diagnosis allowed for an expedient diagnosis of metastatic neuroblastoma. The timely diagnosis allowed for rapid enrolment in a children's oncology group (COG) clinical trial for advanced neuroblastoma. The patient tolerated the therapy without adverse events and remains in remission.",Lynch JD; Tomboc PJ
37077800,From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing.,2023,Biology of sport,,,,"Natural language processing (NLP) has been studied in computing for decades. Recent technological advancements have led to the development of sophisticated artificial intelligence (AI) models, such as Chat Generative Pre-trained Transformer (ChatGPT). These models can perform a range of language tasks and generate human-like responses, which offers exciting prospects for academic efficiency. This manuscript aims at (i) exploring the potential benefits and threats of ChatGPT and other NLP technologies in academic writing and research publications; (ii) highlights the ethical considerations involved in using these tools, and (iii) consider the impact they may have on the authenticity and credibility of academic work. This study involved a literature review of relevant scholarly articles published in peer-reviewed journals indexed in Scopus as quartile 1. The search used keywords such as ""ChatGPT,"" ""AI-generated text,"" ""academic writing,"" and ""natural language processing."" The analysis was carried out using a quasi-qualitative approach, which involved reading and critically evaluating the sources and identifying relevant data to support the research questions. The study found that ChatGPT and other NLP technologies have the potential to enhance academic writing and research efficiency. However, their use also raises concerns about the impact on the authenticity and credibility of academic work. The study highlights the need for comprehensive discussions on the potential use, threats, and limitations of these tools, emphasizing the importance of ethical and academic principles, with human intelligence and critical thinking at the forefront of the research process. This study highlights the need for comprehensive debates and ethical considerations involved in their use. The study also recommends that academics exercise caution when using these tools and ensure transparency in their use, emphasizing the importance of human intelligence and critical thinking in academic work.",Dergaa I; Chamari K; Zmijewski P; Ben Saad H
39059557,Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician.,2024,"Asia-Pacific journal of ophthalmology (Philadelphia, Pa.)",,,,"Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and human language, enabling computers to understand, generate, and derive meaning from human language. NLP's potential applications in the medical field are extensive and vary from extracting data from Electronic Health Records -one of its most well-known and frequently exploited uses- to investigating relationships among genetics, biomarkers, drugs, and diseases for the proposal of new medications. NLP can be useful for clinical decision support, patient monitoring, or medical image analysis. Despite its vast potential, the real-world application of NLP is still limited due to various challenges and constraints, meaning that its evolution predominantly continues within the research domain. However, with the increasingly widespread use of NLP, particularly with the availability of large language models, such as ChatGPT, it is crucial for medical professionals to be aware of the status, uses, and limitations of these technologies.",Rojas-Carabali W; Agrawal R; Gutierrez-Sinisterra L; Baxter SL; Cifuentes-Gonzalez C; Wei YC; Abisheganaden J; Kannapiran P; Wong S; Lee B; de-la-Torre A; Agrawal R
36924907,The exciting potential for ChatGPT in obstetrics and gynecology.,2023,American journal of obstetrics and gynecology,,,,"Natural language processing-the branch of artificial intelligence concerned with the interaction between computers and human language-has advanced markedly in recent years with the introduction of sophisticated deep-learning models. Improved performance in natural language processing tasks, such as text and speech processing, have fueled impressive demonstrations of these models' capabilities. Perhaps no demonstration has been more impactful to date than the introduction of the publicly available online chatbot ChatGPT in November 2022 by OpenAI, which is based on a natural language processing model known as a Generative Pretrained Transformer. Through a series of questions posed by the authors about obstetrics and gynecology to ChatGPT as prompts, we evaluated the model's ability to handle clinical-related queries. Its answers demonstrated that in its current form, ChatGPT can be valuable for users who want preliminary information about virtually any topic in the field. Because its educational role is still being defined, we must recognize its limitations. Although answers were generally eloquent, informed, and lacked a significant degree of mistakes or misinformation, we also observed evidence of its weaknesses. A significant drawback is that the data on which the model has been trained are apparently not readily updated. The specific model that was assessed here, seems to not reliably (if at all) source data from after 2021. Users of ChatGPT who expect data to be more up to date need to be aware of this drawback. An inability to cite sources or to truly understand what the user is asking suggests that it has the capability to mislead. Responsible use of models like ChatGPT will be important for ensuring that they work to help but not harm users seeking information on obstetrics and gynecology.",Grunebaum A; Chervenak J; Pollet SL; Katz A; Chervenak FA
39634994,Evaluating the Performance of ChatGPT in the Prescribing Safety Assessment: Implications for Artificial Intelligence-Assisted Prescribing.,2024,Cureus,,,,"Objective With the rapid advancement of artificial intelligence (AI) technologies, models like Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly being evaluated for their potential applications in healthcare. The Prescribing Safety Assessment (PSA) is a standardised test for junior physicians in the UK to evaluate prescribing competence. This study aims to assess ChatGPT's ability to pass the PSA and its performance across different exam sections. Methodology ChatGPT (version GPT-4) was tested on four official PSA practice papers, each containing 30 questions, in three independent trials per paper, with answers evaluated using official PSA mark schemes. Performance was measured by calculating overall percentage scores and comparing them to the pass marks provided for each practice paper. Subsection performance was also analysed to identify strengths and weaknesses. Results ChatGPT achieved mean scores of 257/300 (85.67%), 236/300 (78.67%), 199/300 (66.33%), and 233/300 (77.67%) across the four papers, consistently surpassing the pass marks where available. ChatGPT performed well in sections requiring factual recall, such as ""Adverse Drug Reactions"", scoring 63/72 (87.50%), and ""Communicating Information"", scoring 63/72 (88.89%). However, it struggled in ""Data Interpretation"", scoring 32/72 (44.44%), showing variability across trials and indicating limitations in handling more complex clinical reasoning tasks. Conclusion While ChatGPT demonstrated strong potential in passing the PSA and excelling in sections requiring factual knowledge, its limitations in data interpretation highlight the current gaps in AI's ability to fully replicate human clinical judgement. ChatGPT shows promise in supporting safe prescribing, particularly in areas prone to human error, such as drug interactions and communicating correct information. However, due to its variability in more complex reasoning tasks, ChatGPT is not yet ready to replace human prescribers and should instead serve as a supplemental tool in clinical practice.",Bull D; Okaygoun D
38593984,Identifying ChatGPT-Written Patient Education Materials Using Text Analysis and Readability.,2024,American journal of perinatology,,,,"OBJECTIVE:  Artificial intelligence (AI)-based text generators such as Chat Generative Pre-Trained Transformer (ChatGPT) have come into the forefront of modern medicine. Given the similarity between AI-generated and human-composed text, tools need to be developed to quickly differentiate the two. Previous work has shown that simple grammatical analysis can reliably differentiate AI-generated text from human-written text. STUDY DESIGN:  In this study, ChatGPT was used to generate 25 articles related to obstetric topics similar to those made by the American College of Obstetrics and Gynecology (ACOG). All articles were geared towards patient education. These AI-generated articles were then analyzed for their readability and grammar using validated scoring systems and compared to real articles from ACOG. RESULTS:  Characteristics of the 25 AI-generated articles included fewer overall characters than original articles (mean 3,066 vs. 7,426; p < 0.0001), a greater average word length (mean 5.3 vs. 4.8; p < 0.0001), and a lower Flesch-Kincaid score (mean 46 vs. 59; p < 0.0001). With this knowledge, a new scoring system was develop to score articles based on their Flesch-Kincaid readability score, number of total characters, and average word length. This novel scoring system was tested on 17 new AI-generated articles related to obstetrics and 7 articles from ACOG, and was able to differentiate between AI-generated articles and human-written articles with a sensitivity of 94.1% and specificity of 100% (Area Under the Curve [AUC] 0.99). CONCLUSION:  As ChatGPT is more widely integrated into medicine, it will be important for health care stakeholders to have tools to separate originally written documents from those generated by AI. While more robust analyses may be required to determine the authenticity of articles written by complex AI technology in the future, simple grammatical analysis can accurately characterize current AI-generated texts with a high degree of sensitivity and specificity. KEY POINTS: . More tools are needed to identify AI-generated text in obstetrics, for both doctors and patients.. . Grammatical analysis is quick and easily done.. . Grammatical analysis is a feasible and accurate way to identify AI-generated text..",Monje S; Ulene S; Gimovsky AC
38599321,Chatbots vs andrologists: Testing 25 clinical cases.,2024,The French journal of urology,,,,"OBJECTIVE: AI-derived language models are booming, and their place in medicine is undefined. The aim of our study is to compare responses to andrology clinical cases, between chatbots and andrologists, to assess the reliability of these technologies. MATERIAL AND METHOD: We analyzed the responses of 32 experts, 18 residents and three chatbots (ChatGPT v3.5, v4 and Bard) to 25 andrology clinical cases. Responses were assessed on a Likert scale ranging from 0 to 2 for each question (0-false response or no response; 1-partially correct response, 2- correct response), on the basis of the latest national or, in the absence of such, international recommendations. We compared the averages obtained for all cases by the different groups. RESULTS: Experts obtained a higher mean score (m=11/12.4 sigma=1.4) than ChatGPT v4 (m=10.7/12.4 sigma=2.2, p=0.6475), ChatGPT v3.5 (m=9.5/12.4 sigma=2.1, p=0.0062) and Bard (m=7.2/12.4 sigma=3.3, p<0.0001). Residents obtained a mean score (m=9.4/12.4 sigma=1.7) higher than Bard (m=7.2/12.4 sigma=3.3, p=0.0053) but lower than ChatGPT v3.5 (m=9.5/12.4 sigma=2.1, p=0.8393) and v4 (m=10.7/12.4 sigma=2.2, p=0.0183) and experts (m=11.0/12.4 sigma=1.4,p=0.0009). ChatGPT v4 performance (m=10.7 sigma=2.2) was better than ChatGPT v3.5 (m=9.5, sigma=2.1, p=0.0476) and Bard performance (m=7.2 sigma=3.3, p<0.0001). CONCLUSION: The use of chatbots in medicine could be relevant. More studies are needed to integrate them into clinical practice.",Perrot O; Schirmann A; Vidart A; Guillot-Tantay C; Izard V; Lebret T; Boillot B; Mesnard B; Lebacle C; Madec FX
39715109,Comparison of the experience and perception of artificial intelligence among practicing doctors and medical students.,2024,"Wiadomosci lekarskie (Warsaw, Poland : 1960)",,,,"OBJECTIVE: Aim: To analyze and compare the experiences and perceptions of artificial intelligence (AI) among practicing doctors and medical students. PATIENTS AND METHODS: Materials and Methods: A survey was conducted among 30 doctors and 30 fifth-year master's students enrolled in the ""Medicine"" program. Participants were asked about their experiences with AI, their perceptions of AI's impact on their education and practice, and their views on the benefits and drawbacks of AI in the medical field. The data were analyzed to compare the responses between the two groups. RESULTS: Results: Among the respondents, 8 doctors (26,67%) and 4 students (13,33%) had not used AI in their practice or studies. The analysis was conducted on the remaining 22 doctors and 26 students. The study found that students generally rated the effectiveness of AI higher than physicians did, particularly in areas such as enhancing work and educational experiences. Both groups used AI primarily for information retrieval, with students showing a slightly greater openness to expanding AI's role in education and practice. Despite recognizing the advantages of AI, both groups expressed concerns regarding its accuracy and reliability. CONCLUSION: Conclusions: The study indicates that while AI, particularly ChatGPT, is increasingly being adopted in medical education and practice, there is still a level of caution and skepticism among both students and professionals. Further research is needed to optimize the integration of AI in medical curricula and address the ethical implications of its use.",Drevitska OO; Butska LV; Drevytskyi OO; Ryzhak VO; Varina HB; Kovalova OV; Medvid IV
38374694,Neurological Diagnosis: Artificial Intelligence Compared With Diagnostic Generator.,2024,The neurologist,,,,"OBJECTIVE: Artificial intelligence has recently become available for widespread use in medicine, including the interpretation of digitized information, big data for tracking disease trends and patterns, and clinical diagnosis. Comparative studies and expert opinion support the validity of imaging and data analysis, yet similar validation is lacking in clinical diagnosis. Artificial intelligence programs are here compared with a diagnostic generator program in clinical neurology. METHODS: Using 4 nonrandomly selected case records from New England Journal of Medicine clinicopathologic conferences from 2017 to 2022, 2 artificial intelligence programs (ChatGPT-4 and GLASS AI) were compared with a neurological diagnostic generator program (NeurologicDx.com) for diagnostic capability and accuracy and source authentication. RESULTS: Compared with NeurologicDx.com, the 2 AI programs showed results varying with order of key term entry and with repeat querying. The diagnostic generator yielded more differential diagnostic entities, with correct diagnoses in 4 of 4 test cases versus 0 of 4 for ChatGPT-4 and 1 of 4 for GLASS AI, respectively, and with authentication of diagnostic entities compared with the AI programs. CONCLUSIONS: The diagnostic generator NeurologicDx yielded a more robust and reproducible differential diagnostic list with higher diagnostic accuracy and associated authentication compared with artificial intelligence programs.",Finelli PF
40060265,"Emergency Medicine Assistants in the Field of Toxicology, Comparison of ChatGPT-3.5 and GEMINI Artificial Intelligence Systems.",2024,Acta medica Lituanica,,,,"OBJECTIVE: Artificial intelligence models human thinking and problem-solving abilities, allowing computers to make autonomous decisions. There is a lack of studies demonstrating the clinical utility of GPT and Gemin in the field of toxicology, which means their level of competence is not well understood. This study compares the responses given by GPT-3.5 and Gemin to those provided by emergency medicine residents. METHODS: This prospective study was focused on toxicology and utilized the widely recognized educational resource 'Tintinalli Emergency Medicine: A Comprehensive Study Guide' for the field of Emergency Medicine. A set of twenty questions, each with five options, was devised to test knowledge of toxicological data as defined in the book. These questions were then used to train ChatGPT GPT-3.5 (Generative Pre-trained Transformer 3.5) by OpenAI and Gemini by Google AI in the clinic. The resulting answers were then meticulously analyzed. RESULTS: 28 physicians, 35.7% of whom were women, were included in our study. A comparison was made between the physician and AI scores. While a significant difference was found in the comparison (F=2.368 and p<0.001), no significant difference was found between the two groups in the post-hoc Tukey test. GPT-3.5 mean score is 9.9+/-0.71, Gemini mean score is 11.30+/-1.17 and, physicians' mean score is 9.82+/-3.70 (Figure 1). CONCLUSIONS: It is clear that GPT-3.5 and Gemini respond similarly to topics in toxicology, just as resident physicians do.",Bedel HA; Bedel C; Selvi F; Zortuk O; Karanci Y
39285054,"Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude for Readability and Accuracy.",2025,Aesthetic plastic surgery,,,,"OBJECTIVE: Assessment of the readability, accuracy, quality, and completeness of ChatGPT (Open AI, San Francisco, CA), Gemini (Google, Mountain View, CA), and Claude (Anthropic, San Francisco, CA) responses to common questions about rhinoplasty. METHODS: Ten questions commonly encountered in the senior author's (SPM) rhinoplasty practice were presented to ChatGPT-4, Gemini and Claude. Seven Facial Plastic and Reconstructive Surgeons with experience in rhinoplasty were asked to evaluate these responses for accuracy, quality, completeness, relevance, and use of medical jargon on a Likert scale. The responses were also evaluated using several readability indices. RESULTS: ChatGPT achieved significantly higher evaluator scores for accuracy, and overall quality but scored significantly lower on completeness compared to Gemini and Claude. All three chatbot responses to the ten questions were rated as neutral to incomplete. All three chatbots were found to use medical jargon and scored at a college reading level for readability scores. CONCLUSIONS: Rhinoplasty surgeons should be aware that the medical information found on chatbot platforms is incomplete and still needs to be scrutinized for accuracy. However, the technology does have potential for use in healthcare education by training it on evidence-based recommendations and improving readability. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Meyer MKR; Kandathil CK; Davis SJ; Durairaj KK; Patel PN; Pepper JP; Spataro EA; Most SP
38511678,Is ChatGPT an Accurate and Reliable Source of Information for Patients with Vaccine and Statin Hesitancy?,2024,Medeniyet medical journal,,,,"OBJECTIVE: Chat Generative Pre-trained Transformer (ChatGPT) is an artificial intelligence (AI) language model that is trained to respond to questions across a wide range of topics. Our aim is to elucidate whether it would be beneficial for patients who are hesitant about vaccines and statins to use ChatGPT. METHODS: This cross-sectional and observational study was conducted from March 2 to March 30, 2023, using OpenAI ChatGPT-3.5. ChatGPT provided responses to 7 questions related to vaccine and statin hesitancy. The same questions were also directed at physicians. Both the answers from ChatGPT and the physicians were assessed for accuracy, clarity, and conciseness by experts in cardiology, internal medicine, and microbiology, who possessed a minimum of 30 years of professional experience. Responses were rated on a scale of 0-4, and the ChatGPT's average score was compared with that of physicians using the Mann-Whitney U test. RESULTS: The mean scores of ChatGPT (3.78+/-0.36) and physicians (3.65+/-0.57) were similar (Mann-Whitney U test p=0.33). The mean scores of ChatGPT were 3.85+/-0.34 for vaccination and 3.68+/-0.35 for statin use. The mean scores of physicians were 3.73+/-0.51 for vaccination and 3.58+/-0.61 for statin use. There was no statistically significant difference between the mean scores of ChatGPT and physicians for both vaccine and statin use (p=0.403 for vaccination, p=0.678 for statin). ChatGPT did not consider sources of conspiratorial information on vaccines and statins. CONCLUSIONS: This study suggests that ChatGPT can be a valuable source of information for guiding patients with vaccine and statin hesitancy.",Torun C; Sarmis A; Oguz A
38084123,Are Different Versions of ChatGPT's Ability Comparable to the Clinical Diagnosis Presented in Case Reports? A Descriptive Study.,2023,Journal of multidisciplinary healthcare,,,,"OBJECTIVE: ChatGPT, an advanced language model developed by OpenAI, holds the opportunity to bring about a transformation in the processing of clinical decision-making within the realm of medicine. Despite the growing popularity of research related on ChatGPT, there is a paucity of research assessing its appropriateness for clinical decision support. Our study delved into ChatGPT's ability to respond in accordance with the diagnoses found in case reports, with the intention of serving as a reference for clinical decision-making. METHODS: We included 147 case reports from the Chinese Medical Association Journal Database that generated primary and secondary diagnoses covering various diseases. Each question was independently posed three times to both GPT-3.5 and GPT-4.0, respectively. The results were analyzed regarding ChatGPT's mean scores and accuracy types. RESULTS: GPT-4.0 displayed moderate accuracy in primary diagnoses. With the increasing number of input, a corresponding enhancement in the accuracy of ChatGPT's outputs became evident. Notably, autoimmune diseases comprised the largest proportion of case reports, and the mean score for primary diagnosis exhibited statistically significant differences in autoimmune diseases. CONCLUSION: Our finding suggested that the potential practicality in utilizing ChatGPT for clinical decision-making. To enhance the accuracy of ChatGPT, it is necessary to integrate it with the existing electronic health record system in the future.",Chen J; Liu L; Ruan S; Li M; Yin C
39232303,Bias Perpetuates Bias: ChatGPT Learns Gender Inequities in Academic Surgery Promotions.,2024,Journal of surgical education,,,,"OBJECTIVE: Gender inequities persist in academic surgery with implicit bias impacting hiring and promotion at all levels. We hypothesized that creating letters of recommendation for both female and male candidates for academic promotion in surgery using an AI platform, ChatGPT, would elucidate the entrained gender biases already present in the promotion process. DESIGN: Using ChatGPT, we generated 6 letters of recommendation for ""a phenomenal surgeon applying for job promotion to associate professor position"", specifying ""female"" or ""male"" before surgeon in the prompt. We compared 3 ""female"" letters to 3 ""male"" letters for differences in length, language, and tone. RESULTS: The letters written for females averaged 298 words compared to 314 for males. Female letters more frequently referred to ""compassion"", ""empathy"", and ""inclusivity""; whereas male letters referred to ""respect"", ""reputation"", and ""skill"". CONCLUSIONS: These findings highlight the gender bias present in promotion letters generated by ChatGPT, reiterating existing literature regarding real letters of recommendation in academic surgery. Our study suggests that surgeons should use AI tools, such as ChatGPT, with caution when writing LORs for academic surgery faculty promotion.",Desai P; Wang H; Davis L; Ullmann TM; DiBrito SR
37701430,The potential of chatbots in chronic venous disease patient management.,2023,JVS-vascular insights,,,,"OBJECTIVE: Health care providers and recipients have been using artificial intelligence and its subfields, such as natural language processing and machine learning technologies, in the form of search engines to obtain medical information for some time now. Although a search engine returns a ranked list of webpages in response to a query and allows the user to obtain information from those links directly, ChatGPT has elevated the interface between humans with artificial intelligence by attempting to provide relevant information in a human-like textual conversation. This technology is being adopted rapidly and has enormous potential to impact various aspects of health care, including patient education, research, scientific writing, pre-visit/post-visit queries, documentation assistance, and more. The objective of this study is to assess whether chatbots could assist with answering patient questions and electronic health record inbox management. METHODS: We devised two questionnaires: (1) administrative and non-complex medical questions (based on actual inbox questions); and (2) complex medical questions on the topic of chronic venous disease. We graded the performance of publicly available chatbots regarding their potential to assist with electronic health record inbox management. The study was graded by an internist and a vascular medicine specialist independently. RESULTS: On administrative and non-complex medical questions, ChatGPT 4.0 performed better than ChatGPT 3.5. ChatGPT 4.0 received a grade of 1 on all the questions: 20 of 20 (100%). ChatGPT 3.5 received a grade of 1 on 14 of 20 questions (70%), grade 2 on 4 of 16 questions (20%), grade 3 on 0 questions (0%), and grade 4 on 2/20 questions (10%). On complex medical questions, ChatGPT 4.0 performed the best. ChatGPT 4.0 received a grade of 1 on 15 of 20 questions (75%), grade 2 on 2 of 20 questions (10%), grade 3 on 2 of 20 questions (10%), and grade 4 on 1 of 20 questions (5%). ChatGPT 3.5 received a grade of 1 on 9 of 20 questions (45%), grade 2 on 4 of 20 questions (20%), grade 3 on 4 of 20 questions (20%), and grade 4 on 3 of 20 questions (15%). Clinical Camel received a grade of 1 on 0 of 20 questions (0%), grade 2 on 5 of 20 questions (25%), grade 3 on 5 of 20 questions (25%), and grade 4 on 10 of 20 questions (50%). CONCLUSIONS: Based on our interactions with ChatGPT regarding the topic of chronic venous disease, it is plausible that in the future, this technology may be used to assist with electronic health record inbox management and offload medical staff. However, for this technology to receive regulatory approval to be used for that purpose, it will require extensive supervised training by subject experts, have guardrails to prevent ""hallucinations"" and maintain confidentiality, and prove that it can perform at a level comparable to (if not better than) humans. (JVS-Vascular Insights 2023;1:100019.).",Athavale A; Baier J; Ross E; Fukaya E
38462064,Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis.,2024,Journal of biomedical informatics,,,,"OBJECTIVE: Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research. METHODS: An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was ""ChatGPT,"" without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327. RESULTS: A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %-60 %, I(2) = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency. CONCLUSION: This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.",Wei Q; Yao Z; Cui Y; Wei B; Jin Z; Xu X
38556438,Evaluating The Role of ChatGPT as a Study Aid in Medical Education in Surgery.,2024,Journal of surgical education,,,,"OBJECTIVE: Our aim was to assess how ChatGPT compares to Google search in assisting medical students during their surgery clerkships. DESIGN: We conducted a crossover study where participants were asked to complete 2 standardized assessments on different general surgery topics before and after they used either Google search or ChatGPT. SETTING: The study was conducted at the Perelman School of Medicine at the University of Pennsylvania (PSOM) in Philadelphia, Pennsylvania. PARTICIPANTS: 19 third-year medical students participated in our study. RESULTS: The baseline (preintervention) performance of participants on both quizzes did not differ between the Google search and ChatGPT groups (p = 0.728). Students overall performed better postintervention and the difference in test scores was statistically significant for both the Google group (p < 0.001) and the ChatGPT group (p = 0.01). The mean percent increase in test scores pre- and postintervention was higher in the Google group at 11% vs. 10% in the ChatGPT group, but this difference was not statistically significant (p = 0.87). Similarly, there was no statistically significant difference in postintervention scores on both assessments between the 2 groups (p = 0.508). Postassessment surveys revealed that all students (100%) have known about ChatGPT before, and 47% have previously used it for various purposes. On a scale of 1 to 10 with 1 being the lowest and 10 being the highest, the feasibility of ChatGPT and its usefulness in finding answers were rated as 8.4 and 6.6 on average, respectively. When asked to rate the likelihood of using ChatGPT in their surgery rotation, the answers ranged between 1 and 3 (""Unlikely"" 47%), 4 to 6 (""intermediate"" 26%), and 7 to 10 (""likely"" 26%). CONCLUSION: Our results show that even though ChatGPT was comparable to Google search in finding answers pertaining to surgery questions, many students were reluctant to use ChatGPT for learning purposes during their surgery clerkship.",Araji T; Brooks AD
38130802,Feasibility and acceptability of ChatGPT generated radiology report summaries for cancer patients.,2023,Digital health,,,,"OBJECTIVE: Patients now have direct access to their radiology reports, which can include complex terminology and be difficult to understand. We assessed ChatGPT's ability to generate summarized MRI reports for patients with prostate cancer and evaluated physician satisfaction with the artificial intelligence (AI)-summarized report. METHODS: We used ChatGPT to summarize five full MRI reports for patients with prostate cancer performed at a single institution from 2021 to 2022. Three summarized reports were generated for each full MRI report. Full MRI and summarized reports were assessed for readability using Flesch-Kincaid Grade Level (FK) score. Radiation oncologists were asked to evaluate the AI-summarized reports via an anonymous questionnaire. Qualitative responses were given on a 1-5 Likert-type scale. Fifty newly diagnosed prostate cancer patient MRIs performed at a single institution were additionally assessed for physician online portal response rates. RESULTS: Fifteen summarized reports were generated from five full MRI reports using ChatGPT. The median FK score for the full MRI reports and summarized reports was 9.6 vs. 5.0, (p < 0.05), respectively. Twelve radiation oncologists responded to our questionnaire. The mean [SD] ratings for summarized reports were factual correctness (4.0 [0.6], understanding 4.0 [0.7]), completeness (4.1 [0.5]), potential for harm (3.5 [0.9]), overall quality (3.4 [0.9]), and likelihood to send to patient (3.1 [1.1]). Current physician online portal response rates were 14/50 (28%) at our institution. CONCLUSIONS: We demonstrate a novel application of ChatGPT to summarize MRI reports at a reading level appropriate for patients. Physicians were likely to be satisfied with the summarized reports with respect to factual correctness, ease of understanding, and completeness. Physicians were less likely to be satisfied with respect to potential for harm, overall quality, and likelihood to send to patients. Further research is needed to optimize ChatGPT's ability to summarize radiology reports and understand what factors influence physician trust in AI-summarized reports.",Chung EM; Zhang SC; Nguyen AT; Atkins KM; Sandler HM; Kamrava M
38613821,PMC-LLaMA: toward building open-source language models for medicine.,2024,Journal of the American Medical Informatics Association : JAMIA,,,,"OBJECTIVE: Recently, large language models (LLMs) have showcased remarkable capabilities in natural language understanding. While demonstrating proficiency in everyday conversations and question-answering (QA) situations, these models frequently struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. In this article, we describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA. MATERIALS AND METHODS: We adapt a general-purpose LLM toward the medical domain, involving data-centric knowledge injection through the integration of 4.8M biomedical academic papers and 30K medical textbooks, as well as comprehensive domain-specific instruction fine-tuning, encompassing medical QA, rationale for reasoning, and conversational dialogues with 202M tokens. RESULTS: While evaluating various public medical QA benchmarks and manual rating, our lightweight PMC-LLaMA, which consists of only 13B parameters, exhibits superior performance, even surpassing ChatGPT. All models, codes, and datasets for instruction tuning will be released to the research community. DISCUSSION: Our contributions are 3-fold: (1) we build up an open-source LLM toward the medical domain. We believe the proposed PMC-LLaMA model can promote further development of foundation models in medicine, serving as a medical trainable basic generative language backbone; (2) we conduct thorough ablation studies to demonstrate the effectiveness of each proposed component, demonstrating how different training data and model scales affect medical LLMs; (3) we contribute a large-scale, comprehensive dataset for instruction tuning. CONCLUSION: In this article, we systematically investigate the process of building up an open-source medical-specific LLM, PMC-LLaMA.",Wu C; Lin W; Zhang X; Zhang Y; Xie W; Wang Y
38758667,Utilizing ChatGPT as a scientific reasoning engine to differentiate conflicting evidence and summarize challenges in controversial clinical questions.,2024,Journal of the American Medical Informatics Association : JAMIA,,,,"OBJECTIVE: Synthesizing and evaluating inconsistent medical evidence is essential in evidence-based medicine. This study aimed to employ ChatGPT as a sophisticated scientific reasoning engine to identify conflicting clinical evidence and summarize unresolved questions to inform further research. MATERIALS AND METHODS: We evaluated ChatGPT's effectiveness in identifying conflicting evidence and investigated its principles of logical reasoning. An automated framework was developed to generate a PubMed dataset focused on controversial clinical topics. ChatGPT analyzed this dataset to identify consensus and controversy, and to formulate unsolved research questions. Expert evaluations were conducted 1) on the consensus and controversy for factual consistency, comprehensiveness, and potential harm and, 2) on the research questions for relevance, innovation, clarity, and specificity. RESULTS: The gpt-4-1106-preview model achieved a 90% recall rate in detecting inconsistent claim pairs within a ternary assertions setup. Notably, without explicit reasoning prompts, ChatGPT provided sound reasoning for the assertions between claims and hypotheses, based on an analysis grounded in relevance, specificity, and certainty. ChatGPT's conclusions of consensus and controversies in clinical literature were comprehensive and factually consistent. The research questions proposed by ChatGPT received high expert ratings. DISCUSSION: Our experiment implies that, in evaluating the relationship between evidence and claims, ChatGPT considered more detailed information beyond a straightforward assessment of sentimental orientation. This ability to process intricate information and conduct scientific reasoning regarding sentiment is noteworthy, particularly as this pattern emerged without explicit guidance or directives in prompts, highlighting ChatGPT's inherent logical reasoning capabilities. CONCLUSION: This study demonstrated ChatGPT's capacity to evaluate and interpret scientific claims. Such proficiency can be generalized to broader clinical research literature. ChatGPT effectively aids in facilitating clinical studies by proposing unresolved challenges based on analysis of existing studies. However, caution is advised as ChatGPT's outputs are inferences drawn from the input literature and could be harmful to clinical practice.",Xie S; Zhao W; Deng G; He G; He N; Lu Z; Hu W; Zhao M; Du J
39833868,Health profession students' perceptions of ChatGPT in healthcare and education: insights from a mixed-methods study.,2025,BMC medical education,,,,"OBJECTIVE: The aim of this study was to investigate the perceptions of health profession students regarding ChatGPT use and the potential impact of integrating ChatGPT in healthcare and education. BACKGROUND: Artificial Intelligence is increasingly utilized in medical education and clinical profession training. However, since its introduction, ChatGPT remains relatively unexplored in terms of health profession students' acceptance of its use in education and practice. DESIGN: This study employed a mixed-methods approach, using a web-based survey. METHODS: The study involved a convenience sample recruited through various methods, including Faculty of Medicine announcements, social media, and snowball sampling, during the second semester (March to June 2023). Data were collected using a structured questionnaire with closed-ended questions and three open-ended questions. The final sample comprised 217 undergraduate health profession students, including 73 (33.6%) nursing students, 65 (30.0%) medical students, and 79 (36.4%) occupational therapy, physiotherapy, and speech therapy students. RESULTS: Among the surveyed students, 86.2% were familiar with ChatGPT, with generally positive perceptions as reflected by a mean score of 4.04 (SD = 0.62) on a scale of 1 to 5. Positive feedback was particularly noted with respect to ChatGPT's role in information retrieval and summarization. The qualitative data revealed three main themes: experiences with ChatGPT, its impact on the quality of healthcare, and its integration into the curriculum. The findings highlight benefits such as serving as a convenient tool for accessing information, reducing human errors, and fostering innovative learning approaches. However, they also underscore areas of concern, including ethical considerations, challenges in fostering critical thinking, and issues related to verification. The absence of significant differences between the different fields of study indicates consistent perceptions across nursing, medicine, and other health profession students. CONCLUSIONS: Our findings underscore the necessity for continuous refinement to enhance ChatGPT's accuracy, reliability, and alignment with the diverse educational needs of health professions. These insights not only deepen our understanding of student perceptions of ChatGPT in healthcare education but also have significant implications for the future integration of AI in health profession practice. The study emphasizes the importance of a careful balance between leveraging the benefits of AI tools and addressing ethical and pedagogical concerns.",Moskovich L; Rozani V
38112347,FROM TEXT TO DIAGNOSE: CHATGPT'S EFFICACY IN MEDICAL DECISION-MAKING.,2023,"Wiadomosci lekarskie (Warsaw, Poland : 1960)",,,,"OBJECTIVE: The aim: Evaluate the diagnostic capabilities of the ChatGPT in the field of medical diagnosis. PATIENTS AND METHODS: Materials and methods: We utilized 50 clinical cases, employing Large Language Model ChatGPT-3.5. The experiment had three phases, each with a new chat setup. In the initial phase, ChatGPT received detailed clinical case descriptions, guided by a ""Persona Pattern"" prompt. In the second phase, cases with diagnostic errors were addressed by providing potential diagnoses for ChatGPT to choose from. The final phase assessed artificial intelligence's ability to mimic a medical practitioner's diagnostic process, with prompts limiting initial information to symptoms and history. RESULTS: Results: In the initial phase, ChatGPT showed a 66.00% diagnostic accuracy, surpassing physicians by nearly 50%. Notably, in 11 cases requiring image inter notpretation, ChatGPT struggled initially but achieved a correct diagnosis for four without added interpretations. In the second phase, ChatGPT demonstrated a remarkable 70.59% diagnostic accuracy, while physicians averaged 41.47%. Furthermore, the overall accuracy of Large Language Model in first and second phases together was 90.00%. In the third phase emulating real doctor decision-making, ChatGPT achieved a 46.00% success rate. CONCLUSION: Conclusions: Our research underscores ChatGPT's strong potential in clinical medicine as a diagnostic tool, especially in structured scenarios. It emphasizes the need for supplementary data and the complexity of medical diagnosis. This contributes valuable insights to AI-driven clinical diagnostics, with a nod to the importance of prompt engineering techniques in ChatGPT's interaction with doctors.",Mykhalko Y; Kish P; Rubtsova Y; Kutsyn O; Koval V
38488302,ChatGPT for Automated Cross-Checking of Authors' Conflicts of Interest Against Industry Payments.,2024,Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery,,,,"OBJECTIVE: The Centers for Medicare & Medicaid Services ""OpenPayments"" database tracks industry payments to US physicians to improve research conflicts of interest (COIs) transparency, but manual cross-checking of articles' authors against this database is labor-intensive. This study aims to assess the potential of large language models (LLMs) like ChatGPT to automate COI data analysis in medical publications. STUDY DESIGN: An observational study analyzing the accuracy of ChatGPT in automating the cross-checking of COI disclosures in medical research articles against the OpenPayments database. SETTING: Publications regarding Food and Drug Administration-approved biologics for chronic rhinosinusitis with nasal polyposis: omalizumab, mepolizumab, and dupilumab. METHODS: First, ChatGPT evaluated author affiliations from PubMed to identify those based in the United States. Second, for author names matching 1 or multiple payment recipients in OpenPayments, ChatGPT undertook a comparative analysis between author affiliation and OpenPayments recipient metadata. Third, ChatGPT scrutinized full article COI statements, producing an intricate matrix of disclosures for each author against each relevant company (Sanofi, Regeneron, Genentech, Novartis, and GlaxoSmithKline). A random subset of responses was manually checked for accuracy. RESULTS: In total, 78 relevant articles and 294 unique US authors were included, leading to 980 LLM queries. Manual verification showed accuracies of 100% (200/200; 95% confidence interval [CI]: 98.1%-100%) for country analysis, 97.4% (113/116; 95% CI: 92.7%-99.1%) for matching author affiliations with OpenPayments metadata, and 99.2% (1091/1100; 95% CI: 98.5%-99.6%) for COI statement data extraction. CONCLUSION: LLMs have robust potential to automate author-company-specific COI cross-checking against the OpenPayments database. Our findings pave the way for streamlined, efficient, and accurate COI assessment that could be widely employed across medical research.",Safranek C; Liu C; Richmond R; Boyi T; Rimmer R; Manes RP
39603549,Taxonomy-based prompt engineering to generate synthetic drug-related patient portal messages.,2024,Journal of biomedical informatics,,,,"OBJECTIVE: The objectives of this study were to: (1) create a corpus of synthetic drug-related patient portal messages to address the current lack of publicly available datasets for model development, (2) assess differences in language used and linguistics among the synthetic patient portal messages, and (3) assess the accuracy of patient-reported drug side effects for different racial groups. METHODS: We leveraged a taxonomy for patient- and clinician-generated content to guide prompt engineering for synthetic drug-related patient portal messages. We generated two groups of messages: the first group (200 messages) used a subset of the taxonomy relevant to a broad range of drug-related messages and the second group (250 messages) used a subset of the taxonomy relevant to a narrow range of messages focused on side effects. Prompts also include one of five racial groups. Next, we assessed linguistic characteristics among message parts (subject, beginning, body, ending) across different prompt specifications (urgency, patient portal taxa, race). We also assessed the performance and frequency of patient-reported side effects across different racial groups and compared to data present in a real world data source (SIDER). RESULTS: The study generated 450 synthetic patient portal messages, and we assessed linguistic patterns, accuracy of drug-side effect pairs, frequency of pairs compared to real world data. Linguistic analysis revealed variations in language usage and politeness and analysis of positive predictive values identified differences in symptoms reported based on urgency levels and racial groups in the prompt. We also found that low incident SIDER drug-side effect pairs were observed less frequently in our dataset. CONCLUSION: This study demonstrates the potential of synthetic patient portal messages as a valuable resource for healthcare research. After creating a corpus of synthetic drug-related patient portal messages, we identified significant language differences and provided evidence that drug-side effect pairs observed in messages are comparable to what is expected in real world settings.",Wang N; Treewaree S; Zirikly A; Lu YL; Nguyen MH; Agarwal B; Shah J; Stevenson JM; Taylor CO
39254470,Is Artificial Intelligence (AI) currently able to provide evidence-based scientific responses on methods that can improve the outcomes of embryo transfers? No.,2024,JBRA assisted reproduction,,,,"OBJECTIVE: The rapid development of Artificial Intelligence (AI) has raised questions about its potential uses in different sectors of everyday life. Specifically in medicine, the question arose whether chatbots could be used as tools for clinical decision-making or patients' and physicians' education. To answer this question in the context of fertility, we conducted a test to determine whether current AI platforms can provide evidence-based responses regarding methods that can improve the outcomes of embryo transfers. METHODS: We asked nine popular chatbots to write a 300-word scientific essay, outlining scientific methods that improve embryo transfer outcomes. We then gathered the responses and extracted the methods suggested by each chatbot. RESULTS: Out of a total of 43 recommendations, which could be grouped into 19 similar categories, only 3/19 (15.8%) were evidence-based practices, those being ""ultrasound-guided embryo transfer"" in 7/9 (77.8%) chatbots, ""single embryo transfer"" in 4/9 (44.4%) and ""use of a soft catheter"" in 2/9 (22.2%), whereas some controversial responses like ""preimplantation genetic testing"" appeared frequently (6/9 chatbots; 66.7%), along with other debatable recommendations like ""endometrial receptivity assay"", ""assisted hatching"" and ""time-lapse incubator"". CONCLUSIONS: Our results suggest that AI is not yet in a position to give evidence-based recommendations in the field of fertility, particularly concerning embryo transfer, since the vast majority of responses consisted of scientifically unsupported recommendations. As such, both patients and physicians should be wary of guiding care based on chatbot recommendations in infertility. Chatbot results might improve with time especially if trained from validated medical databases; however, this will have to be scientifically checked.",Kolokythas A; Dahan MH
38815316,Automating biomedical literature review for rapid drug discovery: Leveraging GPT-4 to expedite pandemic response.,2024,International journal of medical informatics,,,,"OBJECTIVE: The rapid expansion of the biomedical literature challenges traditional review methods, especially during outbreaks of emerging infectious diseases when quick action is critical. Our study aims to explore the potential of ChatGPT to automate the biomedical literature review for rapid drug discovery. MATERIALS AND METHODS: We introduce a novel automated pipeline helping to identify drugs for a given virus in response to a potential future global health threat. Our approach can be used to select PubMed articles identifying a drug target for the given virus. We tested our approach on two known pathogens: SARS-CoV-2, where the literature is vast, and Nipah, where the literature is sparse. Specifically, a panel of three experts reviewed a set of PubMed articles and labeled them as either describing a drug target for the given virus or not. The same task was given to the automated pipeline and its performance was based on whether it labeled the articles similarly to the human experts. We applied a number of prompt engineering techniques to improve the performance of ChatGPT. RESULTS: Our best configuration used GPT-4 by OpenAI and achieved an out-of-sample validation performance with accuracy/F1-score/sensitivity/specificity of 92.87%/88.43%/83.38%/97.82% for SARS-CoV-2 and 87.40%/73.90%/74.72%/91.36% for Nipah. CONCLUSION: These results highlight the utility of ChatGPT in drug discovery and development and reveal their potential to enable rapid drug target identification during a pandemic-level health emergency.",Yang J; Walker KC; Bekar-Cesaretli AA; Hao B; Bhadelia N; Joseph-McCarthy D; Paschalidis IC
38135548,Development and Evaluation of Aeyeconsult: A Novel Ophthalmology Chatbot Leveraging Verified Textbook Knowledge and GPT-4.,2024,Journal of surgical education,,,,"OBJECTIVE: There has been much excitement on the use of large language models (LLMs) such as ChatGPT in ophthalmology. However, LLMs are limited in that they are trained on unverified information and do not cite their sources. This paper highlights a new methodology to create a generative AI chatbot to answer eye care related questions which uses only verified ophthalmology textbooks as data and cites its sources. SETTING: Yale School of Medicine Department of Ophthalmology and Visual Science. DESIGN/METHODS: Aeyeconsult, an ophthalmology chatbot, was developed using GPT-4 (the LLM used to power the publicly available chatbot ChatGPT-4), LangChain, and Pinecone. Ophthalmology textbooks were processed into embeddings and stored in Pinecone. User queries were similarly converted, compared to stored embeddings, and GPT-4 generated responses. The interface was adapted from public code. Both Aeyeconsult and ChatGPT-4 were tested on the same 260 questions from OphthoQuestions.com, with the first response from Aeyeconsult and ChatGPT-4 recorded as the answer. RESULTS: Aeyeconsult outperformed ChatGPT-4 on the OKAP dataset, with 83.4% correct answers compared to 69.2% (p = 0.0118). Aeyeconsult also had fewer instances of no answer and multiple answers. Both systems performed best in General Medicine, with Aeyeconsult achieving 96.2% accuracy. Aeyeconsult's weakest performance was in Clinical Optics at 68.1%, but it still outperformed ChatGPT-4 in this category (45.5%). CONCLUSION: LLMs may be useful in answering ophthalmology questions but their trustworthiness and accuracy is limited due to training on unverified internet data and lack of source citation. We used a new methodology, using verified ophthalmology textbooks as source material and providing citations, to mitigate these issues, resulting in a chatbot more accurate than ChatGPT-4 in answering OKAPs style questions.",Singer MB; Fu JJ; Chow J; Teng CC
37017291,Implications of large language models such as ChatGPT for dental medicine.,2023,Journal of esthetic and restorative dentistry : official publication of the American Academy of Esthetic Dentistry ... [et al.],,,,"OBJECTIVE: This article provides an overview of the implications of ChatGPT and other large language models (LLMs) for dental medicine. OVERVIEW: ChatGPT, a LLM trained on massive amounts of textual data, is adept at fulfilling various language-related tasks. Despite its impressive capabilities, ChatGPT has serious limitations, such as occasionally giving incorrect answers, producing nonsensical content, and presenting misinformation as fact. Dental practitioners, assistants, and hygienists are not likely to be significantly impacted by LLMs. However, LLMs could affect the work of administrative personnel and the provision of dental telemedicine. LLMs offer potential for clinical decision support, text summarization, efficient writing, and multilingual communication. As more people seek health information from LLMs, it is crucial to safeguard against inaccurate, outdated, and biased responses to health-related queries. LLMs pose challenges for patient data confidentiality and cybersecurity that must be tackled. In dental education, LLMs present fewer challenges than in other academic fields. LLMs can enhance academic writing fluency, but acceptable usage boundaries in science need to be established. CONCLUSIONS: While LLMs such as ChatGPT may have various useful applications in dental medicine, they come with risks of malicious use and serious limitations, including the potential for misinformation. CLINICAL SIGNIFICANCE: Along with the potential benefits of using LLMs as an additional tool in dental medicine, it is crucial to carefully consider the limitations and potential risks inherent in such artificial intelligence technologies.",Eggmann F; Weiger R; Zitzmann NU; Blatz MB
37991499,How ChatGPT works: a mini review.,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"OBJECTIVE: This paper offers a mini-review of OpenAI's language model, ChatGPT, detailing its mechanisms, applications in healthcare, and comparisons with other large language models (LLMs). METHODS: The underlying technology of ChatGPT is outlined, focusing on its neural network architecture, training process, and the role of key elements such as input embedding, encoder, decoder, attention mechanism, and output projection. The advancements in GPT-4, including its capacity for internet connection and the integration of plugins for enhanced functionality are discussed. RESULTS: ChatGPT can generate creative, coherent, and contextually relevant sentences, making it a valuable tool in healthcare for patient engagement, medical education, and clinical decision support. Yet, like other LLMs, it has limitations, including a lack of common sense knowledge, a propensity for hallucination of facts, a restricted context window, and potential privacy concerns. CONCLUSION: Despite the limitations, LLMs like ChatGPT offer transformative possibilities for healthcare. With ongoing research in model interpretability, common-sense reasoning, and handling of longer context windows, their potential is vast. It is crucial for healthcare professionals to remain informed about these technologies and consider their ethical integration into practice.",Briganti G
40318343,Taking the plunge together: A student-led faculty learning seminar series on artificial intelligence.,2025,Currents in pharmacy teaching & learning,,,,"OBJECTIVE: This pilot study explored the effectiveness of a student-led faculty development series by evaluating two key outcomes: the capacity of students to deliver meaningful professional development sessions to faculty and the impact of these sessions on faculty perceptions of generative artificial intelligence (AI). METHODS: In a flipped classroom model, two pharmacy students and 12 faculty members engaged in a semester-long learning series on AI. Each week, students presented on a selected topic followed by discussions that facilitated self-directed learning, including decision-making and project management. Faculty perceptions of AI were evaluated before and after the series using an anonymous survey tool (Technology Acceptance Model Edited to Assess ChatGPT Adoption, TAME-ChatGPT). Respondents created a self-chosen code to link their responses. Additionally, students completed a questionnaire to gauge their reflective thinking after the series. RESULTS: Faculty participation averaged 7 members per session. Twelve faculty completed the pre-survey, while 8 faculty completed the post-survey. Among those who had used ChatGPT (n = 4 pre [33 %], n = 2 post [25 %]), scores for usefulness increased, while concerns about risks decreased. In contrast, faculty who had not used ChatGPT (n = 8 pre [67 %], n = 6 post [75 %]) reported unchanged or improved scores for ease of use and reduced anxiety. Both students responded positively to the reflective thinking questionnaire. CONCLUSION: This pilot study demonstrated that a student-led faculty learning series effectively fostered mutual collaborative learning, benefiting both faculty and students. Pharmacy students, often an underutilized resource, can play a valuable role in faculty development. Colleges of pharmacy may enhance faculty engagement by integrating student-led initiatives into their programs.",Munir F; Abdulbaki E; Saiyad Z; Ipema H
38188855,Performance of ChatGPT incorporated chain-of-thought method in bilingual nuclear medicine physician board examinations.,2024,Digital health,,,,"OBJECTIVE: This research explores the performance of ChatGPT, compared to human doctors, in bilingual, Mandarin Chinese and English, medical specialty exam in Nuclear Medicine in Taiwan. METHODS: The study employed generative pre-trained transformer (GPT-4) and integrated chain-of-thoughts (COT) method to enhance performance by triggering and explaining the thinking process to answer the question in a coherent and logical manner. Questions from the Taiwanese Nuclear Medicine Specialty Exam served as the basis for testing. The research analyzed the correctness of AI responses in different sections of the exam and explored the influence of question length and language proportion on accuracy. RESULTS: AI, especially ChatGPT with COT, exhibited exceptional capabilities in theoretical knowledge, clinical medicine, and handling integrated questions, often surpassing, or matching human doctor performance. However, AI struggled with questions related to medical regulations. The analysis of question length showed that questions within the 109-163 words range yielded the highest accuracy. Moreover, an increase in the proportion of English words in questions improved both AI and human accuracy. CONCLUSIONS: This research highlights the potential and challenges of AI in the medical field. ChatGPT demonstrates significant competence in various aspects of medical knowledge. However, areas like medical regulations require improvement. The study also suggests that AI may help in evaluating exam question difficulty and maintaining fairness in examinations. These findings shed light on AI role in the medical field, with potential applications in healthcare education, exam preparation, and multilingual environments. Ongoing AI advancements are expected to further enhance AI utility in the medical domain.",Ting YT; Hsieh TC; Wang YF; Kuo YC; Chen YJ; Chan PK; Kao CH
37274428,Using ChatGPT in Medical Research: Current Status and Future Directions.,2023,Journal of multidisciplinary healthcare,,,,"OBJECTIVE: This review aims to evaluate the current evidence on the use of the Generative Pre-trained Transformer (ChatGPT) in medical research, including but not limited to treatment, diagnosis, or medication provision. METHODS: This review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We searched Google Scholar, Web of Science, PubMed, and Medline to identify studies published between 2022 and 2023 that aimed to utilize ChatGPT in medical research. All identified references were stored in EndNote. RESULTS: We initially identified 114 articles, out of which six studies met the inclusion and exclusion criteria for full-text screening. Among the six studies, two focused on drug development (33.33%), two on literature review writing (33.33%), and one each on medical report improvement, provision of medical information, improving research conduct, data analysis, and personalized medicine (16.67% each). CONCLUSION: ChatGPT has the potential to revolutionize medical research in various ways. However, its accuracy, originality, academic integrity, and ethical issues must be thoroughly discussed and improved before its widespread implementation in clinical research and medical practice.",Ruksakulpiwat S; Kumar A; Ajibade A
40234374,Can interactive artificial intelligence be used for patient explanations of nuclear medicine examinations in Japanese?,2025,Annals of nuclear medicine,,,,"OBJECTIVE: This study aimed to evaluate the accuracy and validity of patient explanations about nuclear medicine examinations generated in Japanese using ChatGPT- 3.5 and ChatGPT- 4. METHODS: ChatGPT was used to generate Japanese language explanations for seven single-photon emission computed tomography examinations (bone scintigraphy, brain perfusion imaging, myocardial perfusion imaging, dopamine transporter scintigraphy [DAT scintigraphy], sentinel lymph node scintigraphy, lung perfusion scintigraphy, and renal function scintigraphy) and (18)F-fluorodeoxyglucose positron emission tomography. Nineteen board-certified nuclear medicine technologists evaluated the accuracy and validity of the responses using a 5-point scale. RESULTS: ChatGPT- 4 demonstrated significantly higher accuracy and validity than ChatGPT- 3.5, with 77.9% of responses rated as above average or excellent for accuracy, in comparison to 36.3% for ChatGPT- 3.5. For validity, 73.1% of ChatGPT- 4's responses were rated as above average or excellent, in comparison to 19.6% for ChatGPT- 3.5. ChatGPT- 4 outperformed ChatGPT- 3.5 in all examinations, with notable improvements in bone scintigraphy, lung perfusion scintigraphy, and DAT scintigraphy. CONCLUSION: These findings suggest that ChatGPT- 4 can be a valuable tool for providing patient explanations of nuclear medicine examinations. However, its application still requires expert supervision, and further research is needed to address potential risks and security concerns.",Matsutomo N; Fukami M; Yamamoto T
38793077,Assessment of the Quality and Readability of Information Provided by ChatGPT in Relation to the Use of Platelet-Rich Plasma Therapy for Osteoarthritis.,2024,Journal of personalized medicine,,,,"Objective: This study aimed to evaluate the quality and readability of information generated by ChatGPT versions 3.5 and 4 concerning platelet-rich plasma (PRP) therapy in the management of knee osteoarthritis (OA), exploring whether large language models (LLMs) could play a significant role in patient education. Design: A total of 23 common patient queries regarding the role of PRP therapy in knee OA management were presented to ChatGPT versions 3.5 and 4. The quality of the responses was assessed using the DISCERN criteria, and readability was evaluated using six established assessment tools. Results: Both ChatGPT versions 3.5 and 4 produced moderate quality information. The quality of information provided by ChatGPT version 4 was significantly better than version 3.5, with mean DISCERN scores of 48.74 and 44.59, respectively. Both models scored highly with respect to response relevance and had a consistent emphasis on the importance of shared decision making. However, both versions produced content significantly above the recommended 8th grade reading level for patient education materials (PEMs), with mean reading grade levels (RGLs) of 17.18 for ChatGPT version 3.5 and 16.36 for ChatGPT version 4, indicating a potential barrier to their utility in patient education. Conclusions: While ChatGPT versions 3.5 and 4 both demonstrated the capability to generate information of moderate quality regarding the role of PRP therapy for knee OA, the readability of the content remains a significant barrier to widespread usage, exceeding the recommended reading levels for PEMs. Although ChatGPT version 4 showed improvements in quality and source citation, future iterations must focus on producing more accessible content to serve as a viable resource in patient education. Collaboration between healthcare providers, patient organizations, and AI developers is crucial to ensure the generation of high quality, peer reviewed, and easily understandable information that supports informed healthcare decisions.",Fahy S; Niemann M; Bohm P; Winkler T; Oehme S
39564507,The performance of AI in medical examinations: an exploration of ChatGPT in ultrasound medical education.,2024,Frontiers in medicine,,,,"OBJECTIVE: This study aims to evaluate the accuracy of ChatGPT in the context of China's Intermediate Professional Technical Qualification Examination for Ultrasound Medicine, exploring its potential role in ultrasound medical education. METHODS: A total of 100 questions, comprising 70 single-choice and 30 multiple-choice questions, were selected from the examination's question bank. These questions were categorized into four groups: basic knowledge, relevant clinical knowledge, professional knowledge, and professional practice. ChatGPT versions 3.5 and 4.0 were tested, and accuracy was measured based on the proportion of correct answers for each version. RESULTS: ChatGPT 3.5 achieved an accuracy of 35.7% for single-choice and 30.0% for multiple-choice questions, while version 4.0 improved to 61.4 and 50.0%, respectively. Both versions performed better in basic knowledge questions but showed limitations in professional practice-related questions. Version 4.0 demonstrated significant improvements across all categories compared to version 3.5, but it still underperformed when compared to resident doctors in certain areas. CONCLUSION: While ChatGPT did not meet the passing criteria for the Intermediate Professional Technical Qualification Examination in Ultrasound Medicine, its strong performance in basic medical knowledge suggests potential as a supplementary tool in medical education. However, its limitations in addressing professional practice tasks need to be addressed.",Hong DR; Huang CY
40075297,AI-assisted decision-making in mild traumatic brain injury.,2025,BMC emergency medicine,,,,"OBJECTIVE: This study evaluates the potential use of ChatGPT in aiding clinical decision-making for patients with mild traumatic brain injury (TBI) by assessing the quality of responses it generates for clinical care. METHODS: Seventeen mild TBI case scenarios were selected from PubMed Central, and each case was analyzed by GPT-4 (March 21, 2024, version) between April 11 and April 20, 2024. Responses were evaluated by four emergency medicine specialists, who rated the ease of understanding, scientific adequacy, and satisfaction with each response using a 7-point Likert scale. Evaluators were also asked to identify critical errors, defined as mistakes in clinical care or interpretation that could lead to morbidity or mortality. The readability of GPT-4's responses was also assessed using the Flesch Reading Ease and Flesch-Kincaid Grade Level tools. RESULTS: There was no significant difference in the ease of understanding between responses with and without critical errors (p = 0.133). However, responses with critical errors significantly reduced satisfaction and scientific adequacy (p < 0.001). GPT-4 responses were significantly more difficult to read than the case descriptions (p < 0.001). CONCLUSION: GPT-4 demonstrates potential utility in clinical decision-making for mild TBI management, offering scientifically appropriate and comprehensible responses. However, critical errors and readability issues limit its immediate implementation in emergency settings without oversight by experienced medical professionals.",Yigit Y; Kaynak MF; Alkahlout B; Ahmed S; Gunay S; Ozbek AE
37545428,Dr. ChatGPT: Utilizing Artificial Intelligence in Surgical Education.,2024,The Cleft palate-craniofacial journal : official publication of the American Cleft Palate-Craniofacial Association,,,,"OBJECTIVE: This study sought to explore the unexamined capabilities of ChatGPT in describing the surgical steps of a specialized operation, the Fisher cleft lip repair. DESIGN: A chat log within ChatGPT was created to generate the procedural steps of a cleft lip repair utilizing the Fisher technique. A board certified craniomaxillofacial (CMF) surgeon then wrote the Fisher repair in his own words blinded to the ChatGPT response. Using both responses, a voluntary survey questionnaire was distributed to residents of plastic and reconstructive surgery (PRS), general surgery (GS), internal medicine (IM), and medical students at our institution in a blinded study. SETTING: Authors collected information from residents (PRS, GS, IM) and medical students at one institution. MAIN OUTCOME MEASURES: Primary outcome measures included understanding, preference, and author identification of the procedural prompts. RESULTS: Results show PRS residents were able to detect more inaccuracies of the ChatGPT response as well as prefer the CMF surgeon's prompt in performing the surgery. Residents with less expertise in the procedure not only failed to detect who wrote what procedure, but preferred the ChatGPT response in explaining the concept and chose it to perform the surgery. CONCLUSIONS: In applications to surgical education, ChatGPT was found to be effective in generating easy to understand procedural steps that can be followed by medical personnel of all specialties. However, it does not have expert capabilities to provide the minute detail of measurements and specific anatomy required to perform medical procedures.",Lebhar MS; Velazquez A; Goza S; Hoppe IC
38314324,Large-Scale assessment of ChatGPT's performance in benign and malignant bone tumors imaging report diagnosis and its potential for clinical applications.,2024,Journal of bone oncology,,,,"OBJECTIVE: This study was designed to delve into the complexities involved in diagnosing of benign and malignant bone tumors and to assess the potential of AI technologies like ChatGPT in improving diagnostic accuracy and efficiency. The study also explores the few-shot learning as a method to optimize ChatGPT's performance in specialized medical domains such as benign and malignant bone tumors diagnosis. METHODS: A total of 1366 benign and malignant bone tumors-related imaging reports were collected and diagnosed by 25 experienced physicians. The gold standard of diagnosis was established by combining clinical, imaging and pathological principles.These reports were then input into the ChatGPT model which underwent a few-shot learning method to generate diagnostic results. The diagnostic results of the physicians and the AI model were compared to evaluate the performance of ChatGPT. An experiment was conducted to assess the influence of different radiologist's reporting styles on the model's diagnostic performance. Furthermore, in-depth analysis of misdiagnosed cases was carried out, categorizing diagnostic errors and exploring possible causes. RESULTS: The diagnostic results generated by ChatGPT showed an accuracy of 0.73, sensitivity of 0.95, and specificity of 0.58. After few-shot learning, ChatGPT demonstrated significant improvement, achieving an accuracy of 0.87, sensitivity of 0.99, and specificity of 0.73, bringing it much closer to the level of physician diagnostics. In an experiment analyzing the influence of the radiologist's reporting style, the model demonstrated higher sensitivity when interpreting reports written by high-level radiologists. In 56 benign cases, ChatGPT misdiagnosed them as malignant. Among these, 35 benign lesions- fibrous dysplasia and osteofibrous dysplasia- were incorrectly identified as metastatic tumors or osteosarcomas; 8 cases of myositis ossificans were wrongly diagnosed as extraosseous osteosarcoma. 7 cases of giant cell tumor of bone at the end of long bone were misdiagnosed as osteosarcoma by intermediate doctors. Chondroblastoma was misdiagnosed as malignant tumor in 6 cases -2 osteosarcoma and 4 chondrosarcoma-In this study, 23 osteosarcoma cases were misdiagnosed by ChatGPT as osteomyelitis; Chondrosarcoma was misdiagnosed as fibrous dysplasia or aneurysmal bone cyst in 8 cases. Four cases of spinal chordoma were misdiagnosed as spinal tuberculosis. CONCLUSION: Our findings highlight the potential of ChatGPT in the diagnosis of benign and malignant bone tumors, offering advantages like enhanced efficiency and a reduction in missed diagnoses. However, the necessity of collaborative interactions between physicians and ChatGPT in practical settings was underscored. With an examination into AI's capacity in benign and malignant bone tumors diagnosis, this study lays the groundwork for future AI advancements in medicine. Additionally, the benefits of few-shot learning in fine-tuning ChatGPT applications in specialized fields were also demonstrated.",Yang F; Yan D; Wang Z
38720222,The Revival of Essay-Type Questions in Medical Education: Harnessing Artificial Intelligence and Machine Learning.,2024,Journal of the College of Physicians and Surgeons--Pakistan : JCPSP,,,,"OBJECTIVE: To analyse and compare the assessment and grading of human-written and machine-written formative essays. STUDY DESIGN: Quasi-experimental, qualitative cross-sectional study. Place and Duration of the Study: Department of Science of Dental Materials, Hamdard College of Medicine & Dentistry, Hamdard University, Karachi, from February to April 2023. METHODOLOGY: Ten short formative essays of final-year dental students were manually assessed and graded. These essays were then graded using ChatGPT version 3.5. The chatbot responses and prompts were recorded and matched with manually graded essays. Qualitative analysis of the chatbot responses was then performed. RESULTS: Four different prompts were given to the artificial intelligence (AI) driven platform of ChatGPT to grade the summative essays. These were the chatbot's initial responses without grading, the chatbot's response to grading against criteria, the chatbot's response to criteria-wise grading, and the chatbot's response to questions for the difference in grading. Based on the results, four innovative ways of using AI and machine learning (ML) have been proposed for medical educators: Automated grading, content analysis, plagiarism detection, and formative assessment. ChatGPT provided a comprehensive report with feedback on writing skills, as opposed to manual grading of essays. CONCLUSION: The chatbot's responses were fascinating and thought-provoking. AI and ML technologies can potentially supplement human grading in the assessment of essays. Medical educators need to embrace AI and ML technology to enhance the standards and quality of medical education, particularly when assessing long and short essay-type questions. Further empirical research and evaluation are needed to confirm their effectiveness. KEY WORDS: Machine learning, Artificial intelligence, Essays, ChatGPT, Formative assessment.",Shamim MS; Zaidi SJA; Rehman A
40273883,Artificial intelligence (AI) performance on pharmacy skills laboratory course assignments.,2025,Currents in pharmacy teaching & learning,,,,"OBJECTIVE: To compare pharmacy student scores to scores of artificial intelligence (AI)-generated results of three common platforms on pharmacy skills laboratory assignments. METHODS: Pharmacy skills laboratory course assignments were completed by four fourth-year pharmacy student investigators with three free AI platforms: ChatGPT, Copilot, and Gemini. Assignments evaluated were calculations, patient case vignettes, in-depth patient cases, drug information questions, and a reflection activity. Course coordinators graded the AI-generated submissions. Descriptive statistics were utilized to summarize AI scores and compare averages to recent pharmacy student cohorts. Interrater reliability for the four student investigators completing the assignments was assessed. RESULTS: Fourteen skills laboratory assignments were completed utilizing three different AI platforms (ChatGPT, Copilot, and Gemini) by four fourth-year student investigators (n = 168 AI-generated submissions). Copilot was unable to complete 12; therefore, 156 AI-generated submissions were graded by the faculty course coordinators for accuracy and scored from 0 to 100 %. Pharmacy student cohort scores were higher than the average AI scores for all of the skills laboratory assignments except for two in-depth patient cases completed with ChatGPT. CONCLUSION: Pharmacy students on average performed better on most skills laboratory assignments than three commonly used artificial intelligence platforms. Teaching students the strengths and weaknesses of utilizing AI in the classroom is essential.",Do V; Donohoe KL; Peddi AN; Carr E; Kim C; Mele V; Patel D; Crawford AN
37217092,The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations.,2023,Fertility and sterility,,,,"OBJECTIVE: To compare the responses of the large language model-based ""ChatGPT"" to reputable sources when given fertility-related clinical prompts. DESIGN: The ""Feb 13"" version of ChatGPT by OpenAI was tested against established sources relating to patient-oriented clinical information: 17 ""frequently asked questions (FAQs)"" about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys, the Cardiff Fertility Knowledge Scale and the Fertility and Infertility Treatment Knowledge Score, as well as the American Society for Reproductive Medicine committee opinion ""optimizing natural fertility."" SETTING: Academic medical center. PATIENT(S): Online AI Chatbot. INTERVENTION(S): Frequently asked questions, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. MAIN OUTCOME MEASURE(S): For FAQs from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were incorrect, referenced a source, or noted the value of consulting providers. FOR FERTILITY KNOWLEDGE SURVEYS: Percentile according to published population data. FOR COMMITTEE OPINION: Whether response to conclusions rephrased as questions identified missing facts. RESULT(S): When administered the CDC's 17 infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response), factual content (8.65 factual statements/response vs. 10.41), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive)), and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective)). In total, 9 (6.12%) of 147 ChatGPT factual statements were categorized as incorrect, and only 1 (0.68%) statement cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score. ChatGPT reproduced the missing facts for all 7 summary statements from ""optimizing natural fertility."" CONCLUSION(S): A February 2023 version of ""ChatGPT"" demonstrates the ability of generative artificial intelligence to produce relevant, meaningful responses to fertility-related clinical queries comparable to established sources. Although performance may improve with medical domain-specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical use.",Chervenak J; Lieman H; Blanco-Breindel M; Jindal S
38712405,Exploring the application of CHATGPT in plastic surgery: a comprehensive systematic review.,2024,JPMA. The Journal of the Pakistan Medical Association,,,,"OBJECTIVE: To determine the impact of ChatGPT in plastic surgery research and assess the authenticity of such contributions. METHODS: The study conducted a literature search in Sep'23 from databases like Pubmed, Google Scholar, SCOPUS, and OVID Medline.The following keywords 'ChatGPT', 'chatbot', 'reconstruction', 'aesthetic' and 'plastic surgery' were used. 32 papers were included from the initial 131 results of articles. English language articles from November 2022 to July 2023 discussing ChatGPT's role in plastic and aesthetic surgery were included whereas non-English documents, irrelevant content, and non-academic sources were excluded from the study. RESULTS: The manuscripts included in the systematic review had a diverse range, including original research articles, case reports, letters to the editor, and editorials. Among the included studies, there were 9 original research articles, 1 case report, 23 letters to the editor, and 2 editorials. Most publications originated from the United States (18) and Australia (7). Analysis suggested concerns, such as inaccuracies, plagiarism, outdated knowledge, and lack of personalized advice. Various authors recommend using ChatGPT as a supplementary tool rather than a replacement for human decision-making in medicine. CONCLUSIONS: ChatGPT shows potential in plastic surgery research, concerns about inaccuracies and outdated knowledge may provide deceiving information and it always requires human input and verification.",Arif F; Safri MK; Shahzad Z; Yasmeen SF; Rahman MF; Shaikh SA
39460888,Performance Assessment of GPT 4.0 on the Japanese Medical Licensing Examination.,2024,Current medical science,,,,"OBJECTIVE: To evaluate the accuracy and parsing ability of GPT 4.0 for Japanese medical practitioner qualification examinations in a multidimensional way to investigate its response accuracy and comprehensiveness to medical knowledge. METHODS: We evaluated the performance of the GPT 4.0 on Japanese Medical Licensing Examination (JMLE) questions (2021-2023). Questions are categorized by difficulty and type, with distinctions between general and clinical parts, as well as between single-choice (MCQ1) and multiple-choice (MCQ2) questions. Difficulty levels were determined on the basis of correct rates provided by the JMLE Preparatory School. The accuracy and quality of the GPT 4.0 responses were analyzed via an improved Global Qualily Scale (GQS) scores, considering both the chosen options and the accompanying analysis. Descriptive statistics and Pearson Chi-square tests were used to examine performance across exam years, question difficulty, type, and choice. GPT 4.0 ability was evaluated via the GQS, with comparisons made via the Mann-Whitney U or Kruskal-Wallis test. RESULTS: The correct response rate and parsing ability of the GPT4.0 to the JMLE questions reached the qualification level (80.4%). In terms of the accuracy of the GPT4.0 response to the JMLE, we found significant differences in accuracy across both difficulty levels and option types. According to the GQS scores for the GPT 4.0 responses to all the JMLE questions, the performance of the questionnaire varied according to year and choice type. CONCLUSION: GTP4.0 performs well in providing basic support in medical education and medical research, but it also needs to input a large amount of medical-related data to train its model and improve the accuracy of its medical knowledge output. Further integration of ChatGPT with the medical field could open new opportunities for medicine.",Wang HL; Zhou H; Zhang JY; Xie Y; Yang JM; Xue MD; Yan ZN; Li W; Zhang XB; Wu Y; Chen XL; Liu PR; Lu L; Ye ZW
37277096,Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases.,2023,Ophthalmology. Retina,,,,"OBJECTIVE: To evaluate the appropriateness and readability of the medical knowledge provided by ChatGPT-4, an artificial intelligence-powered conversational search engine, regarding common vitreoretinal surgeries for retinal detachments (RDs), macular holes (MHs), and epiretinal membranes (ERMs). DESIGN: Retrospective cross-sectional study. SUBJECTS: This study did not involve any human participants. METHODS: We created lists of common questions about the definition, prevalence, visual impact, diagnostic methods, surgical and nonsurgical treatment options, postoperative information, surgery-related complications, and visual prognosis of RD, MH, and ERM, and asked each question 3 times on the online ChatGPT-4 platform. The data for this cross-sectional study were recorded on April 25, 2023. Two independent retina specialists graded the appropriateness of the responses. Readability was assessed using Readable, an online readability tool. MAIN OUTCOME MEASURES: The ""appropriateness"" and ""readability"" of the answers generated by ChatGPT-4 bot. RESULTS: Responses were consistently appropriate in 84.6% (33/39), 92% (23/25), and 91.7% (22/24) of the questions related to RD, MH, and ERM, respectively. Answers were inappropriate at least once in 5.1% (2/39), 8% (2/25), and 8.3% (2/24) of the respective questions. The average Flesch Kincaid Grade Level and Flesch Reading Ease Score were 14.1 +/- 2.6 and 32.3 +/- 10.8 for RD, 14 +/- 1.3 and 34.4 +/- 7.7 for MH, and 14.8 +/- 1.3 and 28.1 +/- 7.5 for ERM. These scores indicate that the answers are difficult or very difficult to read for the average lay person and college graduation would be required to understand the material. CONCLUSIONS: Most of the answers provided by ChatGPT-4 were consistently appropriate. However, ChatGPT and other natural language models in their current form are not a source of factual information. Improving the credibility and readability of responses, especially in specialized fields, such as medicine, is a critical focus of research. Patients, physicians, and laypersons should be advised of the limitations of these tools for eye- and health-related counseling. FINANCIAL DISCLOSURE(S): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.",Momenaei B; Wakabayashi T; Shahlaee A; Durrani AF; Pandit SA; Wang K; Mansour HA; Abishek RM; Xu D; Sridhar J; Yonekawa Y; Kuriyan AE
37614494,Diagnostic and Management Applications of ChatGPT in Structured Otolaryngology Clinical Scenarios.,2023,OTO open,,,,"OBJECTIVE: To evaluate the clinical applications and limitations of chat generative pretrained transformer (ChatGPT) in otolaryngology. STUDY DESIGN: Cross-sectional survey. SETTING: Tertiary academic center. METHODS: ChatGPT 4.0 was queried for diagnoses and management plans for 20 physician-written clinical vignettes in otolaryngology. Attending physicians were then asked to rate the difficulty of the clinical vignettes and agreement with the differential diagnoses and management plans of ChatGPT responses on a 5-point Likert scale. Summary statistics were calculated. Univariate ordinal regression was then performed between vignette difficulty and quality of the diagnoses and management plans. RESULTS: Eleven attending physicians completed the survey (61% response rate). Overall, vignettes were rated as very easy to neutral difficulty (range of median score: 1.00-4.00; overall median 2.00). There was a high agreement with the differential diagnosis provided by ChatGPT (range of median score: 3.00-5.00; overall median: 5.00). There was also high agreement with treatment plans (range of median score: 3.00-5.00; overall median: 5.00). There was no association between vignette difficulty and agreement with differential diagnosis or treatment. Lower diagnosis scores had greater odds of having lower treatment scores. CONCLUSION: Generative artificial intelligence models like ChatGPT are being rapidly adopted in medicine. Performance with curated, easy-to-moderate difficulty otolaryngology scenarios indicate high agreement with physicians for diagnosis and management. However, a decreased quality in diagnosis is associated with decreased quality in management. Further research is necessary on ChatGPT's ability to handle unstructured clinical information.",Qu RW; Qureshi U; Petersen G; Lee SC
37691497,Use of artificial intelligence large language models as a clinical tool in rehabilitation medicine: a comparative test case.,2023,Journal of rehabilitation medicine,,,,"OBJECTIVE: To explore the potential use of artificial intelligence language models in formulating rehabilitation prescriptions and International Classification of Functioning, Disability and Health (ICF) codes.  Design: Comparative study based on a single case report compared to standard answers from a textbook. SUBJECTS: A stroke case from textbook.  Methods: Chat Generative Pre-Trained Transformer-4 (ChatGPT-4)was used to generate comprehensive medical and rehabilitation prescription information and ICF codes pertaining to the stroke case. This information was compared with standard answers from textbook, and 2 licensed Physical Medicine and Rehabilitation (PMR) clinicians reviewed the artificial intelligence recommendations for further discussion. RESULTS: ChatGPT-4 effectively formulated rehabilitation prescriptions and ICF codes for a typical stroke case, together with a rationale to support its recommendations. This information was generated in seconds. Compared with standard answers, the large language model generated broader and more general prescriptions in terms of medical problems and management plans, rehabilitation problems and management plans, as well as rehabilitation goals. It also demonstrated the ability to propose specified approaches for each rehabilitation therapy. The language model made an error regarding the ICF category for the stroke case, but no mistakes were identified in the ICF codes assigned.  Conclusion: This test case suggests that artificial intelligence language models have potential use in facilitating clinical practice and education in the field of rehabilitation medicine.",Zhang L; Tashiro S; Mukaino M; Yamada S
38578616,Can large language models provide secondary reliable opinion on treatment options for dermatological diseases?,2024,Journal of the American Medical Informatics Association : JAMIA,,,,"OBJECTIVE: To investigate the consistency and reliability of medication recommendations provided by ChatGPT for common dermatological conditions, highlighting the potential for ChatGPT to offer second opinions in patient treatment while also delineating possible limitations. MATERIALS AND METHODS: In this mixed-methods study, we used survey questions in April 2023 for drug recommendations generated by ChatGPT with data from secondary databases, that is, Taiwan's National Health Insurance Research Database and an US medical center database, and validated by dermatologists. The methodology included preprocessing queries, executing them multiple times, and evaluating ChatGPT responses against the databases and dermatologists. The ChatGPT-generated responses were analyzed statistically in a disease-drug matrix, considering disease-medication associations (Q-value) and expert evaluation. RESULTS: ChatGPT achieved a high 98.87% dermatologist approval rate for common dermatological medication recommendations. We evaluated its drug suggestions using the Q-value, showing that human expert validation agreement surpassed Q-value cutoff-based agreement. Varying cutoff values for disease-medication associations, a cutoff of 3 achieved 95.14% accurate prescriptions, 5 yielded 85.42%, and 10 resulted in 72.92%. While ChatGPT offered accurate drug advice, it occasionally included incorrect ATC codes, leading to issues like incorrect drug use and type, nonexistent codes, repeated errors, and incomplete medication codes. CONCLUSION: ChatGPT provides medication recommendations as a second opinion in dermatology treatment, but its reliability and comprehensiveness need refinement for greater accuracy. In the future, integrating a medical domain-specific knowledge base for training and ongoing optimization will enhance the precision of ChatGPT's results.",Iqbal U; Lee LT; Rahmanti AR; Celi LA; Li YJ
39316174,Accuracy and consistency of ChatGPT-3.5 and - 4 in providing differential diagnoses in oral and maxillofacial diseases: a comparative diagnostic performance analysis.,2024,Clinical oral investigations,,,,"OBJECTIVE: To investigate the performance of ChatGPT in the differential diagnosis of oral and maxillofacial diseases. METHODS: Thirty-seven oral and maxillofacial lesions findings were presented to ChatGPT-3.5 and - 4, 18 dental surgeons trained in oral medicine/pathology (OMP), 23 general dental surgeons (DDS), and 16 dental students (DS) for differential diagnosis. Additionally, a group of 15 general dentists was asked to describe 11 cases to ChatGPT versions. The ChatGPT-3.5, -4, and human primary and alternative diagnoses were rated by 2 independent investigators with a 4 Likert-Scale. The consistency of ChatGPT-3.5 and - 4 was evaluated with regenerated inputs. RESULTS: Moderate consistency of outputs was observed for ChatGPT-3.5 and - 4 to provide primary (kappa = 0.532 and kappa = 0.533 respectively) and alternative (kappa = 0.337 and kappa = 0.367 respectively) hypotheses. The mean of correct diagnoses was 64.86% for ChatGPT-3.5, 80.18% for ChatGPT-4, 86.64% for OMP, 24.32% for DDS, and 16.67% for DS. The mean correct primary hypothesis rates were 45.95% for ChatGPT-3.5, 61.80% for ChatGPT-4, 82.28% for OMP, 22.72% for DDS, and 15.77% for DS. The mean correct diagnosis rate for ChatGPT-3.5 with standard descriptions was 64.86%, compared to 45.95% with participants' descriptions. For ChatGPT-4, the mean was 80.18% with standard descriptions and 61.80% with participant descriptions. CONCLUSION: ChatGPT-4 demonstrates an accuracy comparable to specialists to provide differential diagnosis for oral and maxillofacial diseases. Consistency of ChatGPT to provide diagnostic hypotheses for oral diseases cases is moderate, representing a weakness for clinical application. The quality of case documentation and descriptions impacts significantly on the performance of ChatGPT. CLINICAL RELEVANCE: General dentists, dental students and specialists in oral medicine and pathology may benefit from ChatGPT-4 as an auxiliary method to define differential diagnosis for oral and maxillofacial lesions, but its accuracy is dependent on precise case descriptions.",Tomo S; Lechien JR; Bueno HS; Cantieri-Debortoli DF; Simonato LE
39799741,The role of artificial intelligence in gynecologic and obstetric emergencies.,2025,"European journal of obstetrics, gynecology, and reproductive biology",,,,"OBJECTIVE: To investigate the potential of artificial intelligence (AI) in emergency medicine, focusing on its utility in triaging and managing acute gynecologic and obstetric emergencies. METHODS AND MATERIALS: This feasibility study assessed Chat-GPT's performance in triaging and recommending management interventions for gynecologic and obstetric emergencies, using ten fictive cases. Five common conditions were modeled for each specialty. Chat-GPT was tasked with proposing triage classifications and providing immediate management recommendations. Human experts independently reviewed each case, classified triage categories, and proposed management. Following this, experts evaluated Chat-GPT's recommendations, rating the AI's responses on accuracy and clinical applicability. RESULTS: Chat-GPT's recommendations demonstrated high concordance with human evaluators. Chat-GPT's triage classifications matched those of human experts in most cases, though minor discrepancies in urgency ratings were observed. The AIs suggestions were mostly rated as ""very good"" to ""excellent."" While Chat-GPT consistently delivered appropriate responses, some human evaluators noted slight differences in perceived urgency. CONCLUSIONS: This study highlights Chat-GPT's potential as a clinical support tool in emergency medicine. Chat-GPT provided structured, evidence-based recommendations comparable to those of experienced clinicians, especially for high-stakes gynecologic and obstetric emergencies. Although encouraging, these results highlight the value of utilizing AI in addition to human knowledge, as variations in urgency ratings and management nuances highlight the necessity of human supervision in crucial decision-making.",Psilopatis I; Heindl F; Cupisti S; Fischer U; Kohlmann V; Schneider M; Bader S; Krueckel A; Emons J
39420246,Exploring the potential of artificial intelligence models for triage in the emergency department.,2024,Postgraduate medicine,,,,"OBJECTIVE: To perform a comparative analysis of the three-level triage protocol conducted by triage nurses and emergency medicine doctors with the use of ChatGPT, Gemini, and Pi, which are recognized artificial intelligence (AI) models widely used in the daily life. MATERIALS AND METHODS: The study was prospectively conducted with patients presenting to the emergency department of a tertiary care hospital from 1 April 2024, to 7 April 2024. Among the patients who presented to the emergency department over this period, data pertaining to their primary complaints, arterial blood pressure values, heart rates, peripheral oxygen saturation values measured by pulse oximetry, body temperature values, age, and gender characteristics were analyzed. The triage categories determined by triage nurses, the abovementioned AI chatbots, and emergency medicine doctors were compared. RESULTS: The study included 500 patients, of whom 23.8% were categorized identically by all triage evaluators. Compared to the triage conducted by emergency medicine doctors, triage nurses overtriaged 6.4% of the patients and undertriaged 3.1% of the yellow-coded patients and 3.4% of the red-coded patients. Of the AI chatbots, ChatGPT exhibited the closest triage approximation to that of emergency medicine doctors; however, its undertriage rates were 26.5% for yellow-coded patients and 42.6% for red-coded patients. CONCLUSION: The undertriage rates observed in AI models were considerably high. Hence, it does not yet seem appropriate to solely rely on the specified AI models for triage purposes in the emergency department.",Tortum F; Kasali K
40290213,"Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot.",2025,Pakistan journal of medical sciences,,,,"OBJECTIVE: Using artificial intelligence tools that work with different software architectures for both clinical and educational purposes in the medical field has been a subject of considerable interest recently. In this study, we compared the answers given by three different artificial intelligence chatbots to the Emergency Medicine question pool obtained from the questions asked in the Turkish National Medical Specialization Exam. We tried to investigate the effects on the answers given by classifying the questions in terms of content and form and examining the question sentences. METHODS: The questions related to emergency medicine of the Medical Specialization Exam questions between 2015-2020 were recorded. The questions were asked to artificial intelligence models, including ChatGPT-4, Gemini, and Copilot. The length of the questions, the question type and the topics of the wrong answers were recorded. RESULTS: The most successful chatbot in terms of total score was Microsoft Copilot (7.8% error margin), while the least successful was Google Gemini (22.9% error margin) (p<0.001). It was important that all chatbots had the highest error margins in questions about trauma and surgical approaches and made mistakes in burns and pediatrics. The increase in the error rates in questions containing the root ""probability"" also showed that the question style affected the answers given. CONCLUSIONS: Although chatbots show promising success in determining the correct answer, we think that they should not see chatbots as a primary source for the exam, but rather as a good auxiliary tool to support their learning processes.",Aksoy I; Arslan MK
38487300,Can ChatGPT-3.5 Pass a Medical Exam? A Systematic Review of ChatGPT's Performance in Academic Testing.,2024,Journal of medical education and curricular development,,,,"OBJECTIVE: We, therefore, aim to conduct a systematic review to assess the academic potential of ChatGPT-3.5, along with its strengths and limitations when giving medical exams. METHOD: Following PRISMA guidelines, a systemic search of the literature was performed using electronic databases PUBMED/MEDLINE, Google Scholar, and Cochrane. Articles from their inception till April 4, 2023, were queried. A formal narrative analysis was conducted by systematically arranging similarities and differences between individual findings together. RESULTS: After rigorous screening, 12 articles underwent this review. All the selected papers assessed the academic performance of ChatGPT-3.5. One study compared the performance of ChatGPT-3.5 with the performance of ChatGPT-4 when giving a medical exam. Overall, ChatGPT performed well in 4 tests, averaged in 4 tests, and performed badly in 4 tests. ChatGPT's performance was directly proportional to the level of the questions' difficulty but was unremarkable on whether the questions were binary, descriptive, or MCQ-based. ChatGPT's explanation, reasoning, memory, and accuracy were remarkably good, whereas it failed to understand image-based questions, and lacked insight and critical thinking. CONCLUSION: ChatGPT-3.5 performed satisfactorily in the exams it took as an examinee. However, there is a need for future related studies to fully explore the potential of ChatGPT in medical education.",Sumbal A; Sumbal R; Amir A
38818395,Global trends and hotspots of ChatGPT in medical research: a bibliometric and visualized study.,2024,Frontiers in medicine,,,,"OBJECTIVE: With the rapid advancement of Chat Generative Pre-Trained Transformer (ChatGPT) in medical research, our study aimed to identify global trends and focal points in this domain. METHOD: All publications on ChatGPT in medical research were retrieved from the Web of Science Core Collection (WoSCC) by Clarivate Analytics from January 1, 2023, to January 31, 2024. The research trends and focal points were visualized and analyzed using VOSviewer and CiteSpace. RESULTS: A total of 1,239 publications were collected and analyzed. The USA contributed the largest number of publications (458, 37.145%) with the highest total citation frequencies (2,461) and the largest H-index. Harvard University contributed the highest number of publications (33) among all full-time institutions. The Cureus Journal of Medical Science published the most ChatGPT-related research (127, 10.30%). Additionally, Wiwanitkit V contributed the majority of publications in this field (20). ""Artificial Intelligence (AI) and Machine Learning (ML),"" ""Education and Training,"" ""Healthcare Applications,"" and ""Data Analysis and Technology"" emerged as the primary clusters of keywords. These areas are predicted to remain hotspots in future research in this field. CONCLUSION: Overall, this study signifies the interdisciplinary nature of ChatGPT research in medicine, encompassing AI and ML technologies, education and training initiatives, diverse healthcare applications, and data analysis and technology advancements. These areas are expected to remain at the forefront of future research, driving continued innovation and progress in the field of ChatGPT in medical research.",Liu L; Qu S; Zhao H; Kong L; Xie Z; Jiang Z; Zou P
38073946,Potential Use of ChatGPT for Patient Information in Periodontology: A Descriptive Pilot Study.,2023,Cureus,,,,"Objectives The aim of this study is to evaluate the accuracy and completeness of the answers given by Chat Generative Pre-trained Transformer (ChatGPT) (OpenAI OpCo, LLC, San Francisco, CA), to the most frequently asked questions on different topics in the field of periodontology. Methods The 10 most frequently asked questions by patients about seven different topics (periodontal diseases, peri-implant diseases, tooth sensitivity, gingival recessions, halitosis, dental implants, and periodontal surgery) in periodontology were created by ChatGPT. To obtain responses, a set of 70 questions was submitted to ChatGPT, with an allocation of 10 questions per subject. The responses that were documented were assessed using two distinct Likert scales by professionals specializing in the subject of periodontology. The accuracy of the responses was rated on a Likert scale ranging from one to six, while the completeness of the responses was rated on a scale ranging from one to three. Results The median accuracy score for all responses was six, while the completeness score was two. The mean scores for accuracy and completeness were 5.50 +/- 0.23 and 2.34 +/- 0.24, respectively. It was observed that ChatGPT's responses to the most frequently asked questions by patients for information purposes in periodontology were at least ""nearly completely correct"" in terms of accuracy and ""adequate"" in terms of completeness. There was a statistically significant difference between subjects in terms of accuracy and completeness (P<0.05). The highest and lowest accuracy scores were peri-implant diseases and gingival recession, respectively, while the highest and lowest completeness scores were gingival recession and dental implants, respectively. Conclusions The utilization of large language models has become increasingly prevalent, extending its applicability to patients within the healthcare domain. While ChatGPT may not offer absolute precision and comprehensive results without expert supervision, it is apparent that those within the field of periodontology can utilize it as an informational resource, albeit acknowledging the potential for inaccuracies.",Babayigit O; Tastan Eroglu Z; Ozkan Sen D; Ucan Yarkac F
38995551,Assessing ChatGPT's theoretical knowledge and prescriptive accuracy in bacterial infections: a comparative study with infectious diseases residents and specialists.,2024,Infection,,,,"OBJECTIVES: Advancements in Artificial Intelligence(AI) have made platforms like ChatGPT increasingly relevant in medicine. This study assesses ChatGPT's utility in addressing bacterial infection-related questions and antibiogram-based clinical cases. METHODS: This study involved a collaborative effort involving infectious disease (ID) specialists and residents. A group of experts formulated six true/false, six open-ended questions, and six clinical cases with antibiograms for four types of infections (endocarditis, pneumonia, intra-abdominal infections, and bloodstream infection) for a total of 96 questions. The questions were submitted to four senior residents and four specialists in ID and inputted into ChatGPT-4 and a trained version of ChatGPT-4. A total of 720 responses were obtained and reviewed by a blinded panel of experts in antibiotic treatments. They evaluated the responses for accuracy and completeness, the ability to identify correct resistance mechanisms from antibiograms, and the appropriateness of antibiotics prescriptions. RESULTS: No significant difference was noted among the four groups for true/false questions, with approximately 70% correct answers. The trained ChatGPT-4 and ChatGPT-4 offered more accurate and complete answers to the open-ended questions than both the residents and specialists. Regarding the clinical case, we observed a lower accuracy from ChatGPT-4 to recognize the correct resistance mechanism. ChatGPT-4 tended not to prescribe newer antibiotics like cefiderocol or imipenem/cilastatin/relebactam, favoring less recommended options like colistin. Both trained- ChatGPT-4 and ChatGPT-4 recommended longer than necessary treatment periods (p-value = 0.022). CONCLUSIONS: This study highlights ChatGPT's capabilities and limitations in medical decision-making, specifically regarding bacterial infections and antibiogram analysis. While ChatGPT demonstrated proficiency in answering theoretical questions, it did not consistently align with expert decisions in clinical case management. Despite these limitations, the potential of ChatGPT as a supportive tool in ID education and preliminary analysis is evident. However, it should not replace expert consultation, especially in complex clinical decision-making.",De Vito A; Geremia N; Marino A; Bavaro DF; Caruana G; Meschiari M; Colpani A; Mazzitelli M; Scaglione V; Venanzi Rullo E; Fiore V; Fois M; Campanella E; Pistara E; Faltoni M; Nunnari G; Cattelan A; Mussini C; Bartoletti M; Vaira LA; Madeddu G
40097790,Assessing the performance of an artificial intelligence based chatbot in the differential diagnosis of oral mucosal lesions: clinical validation study.,2025,Clinical oral investigations,,,,"OBJECTIVES: Artificial intelligence (AI) is becoming more popular in medicine. The current study aims to investigate, primarily, if an AI-based chatbot, such as ChatGPT, could be a valid tool for assisting in establishing a differential diagnosis of oral mucosal lesions. METHODS: Data was gathered from patients who were referred to our clinic for an oral mucosal biopsy by one oral medicine specialist. Clinical description, differential diagnoses, and final histopathologic diagnoses were retrospectively extracted from patient records. The lesion description was inputted into ChatGPT version 4.0 under a uniform script to generate three differential diagnoses. ChatGPT and an oral medicine specialist's differential diagnosis were compared to the final histopathologic diagnosis. RESULTS: 100 oral soft tissue lesions were evaluated. A statistically significant correlation was found between the ability of the Chatbot and the Specialist to accurately diagnose the cases (P < 0.001). ChatGPT demonstrated remarkable sensitivity for diagnosing urgent cases, as none of the malignant lesions were missed by the chatbot. At the same time, the specificity of the specialist was higher in cases of malignant lesion diagnosis (p < 0.05). The chatbot performance was reliable in two different events (p < 0.01). CONCLUSION: ChatGPT-4 has shown the ability to pinpoint suspicious malignant lesions and suggest an adequate differential diagnosis for soft tissue lesions, in a consistent and repetitive manner. CLINICAL RELEVANCE: This study serves as a primary insight into the role of AI chatbots, as assisting tools in oral medicine and assesses their clinical capabilities.",Grinberg N; Whitefield S; Kleinman S; Ianculovici C; Wasserman G; Peleg O
39588809,Artificial intelligence and pain medicine education: Benefits and pitfalls for the medical trainee.,2025,Pain practice : the official journal of World Institute of Pain,,,,"OBJECTIVES: Artificial intelligence (AI) represents an exciting and evolving technology that is increasingly being utilized across pain medicine. Large language models (LLMs) are one type of AI that has become particularly popular. Currently, there is a paucity of literature analyzing the impact that AI may have on trainee education. As such, we sought to assess the benefits and pitfalls that AI may have on pain medicine trainee education. Given the rapidly increasing popularity of LLMs, we particularly assessed how these LLMs may promote and hinder trainee education through a pilot quality improvement project. MATERIALS AND METHODS: A comprehensive search of the existing literature regarding AI within medicine was performed to identify its potential benefits and pitfalls within pain medicine. The pilot project was approved by UPMC Quality Improvement Review Committee (#4547). Three of the most commonly utilized LLMs at the initiation of this pilot study - ChatGPT Plus, Google Bard, and Bing AI - were asked a series of multiple choice questions to evaluate their ability to assist in learner education within pain medicine. RESULTS: Potential benefits of AI within pain medicine trainee education include ease of use, imaging interpretation, procedural/surgical skills training, learner assessment, personalized learning experiences, ability to summarize vast amounts of knowledge, and preparation for the future of pain medicine. Potential pitfalls include discrepancies between AI devices and associated cost-differences, correlating radiographic findings to clinical significance, interpersonal/communication skills, educational disparities, bias/plagiarism/cheating concerns, lack of incorporation of private domain literature, and absence of training specifically for pain medicine education. Regarding the quality improvement project, ChatGPT Plus answered the highest percentage of all questions correctly (16/17). Lowest correctness scores by LLMs were in answering first-order questions, with Google Bard and Bing AI answering 4/9 and 3/9 first-order questions correctly, respectively. Qualitative evaluation of these LLM-provided explanations in answering second- and third-order questions revealed some reasoning inconsistencies (e.g., providing flawed information in selecting the correct answer). CONCLUSIONS: AI represents a continually evolving and promising modality to assist trainees pursuing a career in pain medicine. Still, limitations currently exist that may hinder their independent use in this setting. Future research exploring how AI may overcome these challenges is thus required. Until then, AI should be utilized as supplementary tool within pain medicine trainee education and with caution.",Glicksman M; Wang S; Yellapragada S; Robinson C; Orhurhu V; Emerick T
38940388,Artificial intelligence chatbot vs pathology faculty and residents: Real-world clinical questions from a genitourinary treatment planning conference.,2024,American journal of clinical pathology,,,,"OBJECTIVES: Artificial intelligence (AI)-based chatbots have demonstrated accuracy in a variety of fields, including medicine, but research has yet to substantiate their accuracy and clinical relevance. We evaluated an AI chatbot's answers to questions posed during a treatment planning conference. METHODS: Pathology residents, pathology faculty, and an AI chatbot (OpenAI ChatGPT [January 30, 2023, release]) answered a questionnaire curated from a genitourinary subspecialty treatment planning conference. Results were evaluated by 2 blinded adjudicators: a clinician expert and a pathology expert. Scores were based on accuracy and clinical relevance. RESULTS: Overall, faculty scored highest (4.75), followed by the AI chatbot (4.10), research-prepared residents (3.50), and unprepared residents (2.87). The AI chatbot scored statistically significantly better than unprepared residents (P = .03) but not statistically significantly different from research-prepared residents (P = .33) or faculty (P = .30). Residents did not statistically significantly improve after research (P = .39), and faculty performed statistically significantly better than both resident categories (unprepared, P < .01; research prepared, P = .01). CONCLUSIONS: The AI chatbot gave answers to medical questions that were comparable in accuracy and clinical relevance to pathology faculty, suggesting promise for further development. Serious concerns remain, however, that without the ability to provide support with references, AI will face legitimate scrutiny as to how it can be integrated into medical decision-making.",Luo MX; Lyle A; Bennett P; Albertson D; Sirohi D; Maughan BL; McMurtry V; Mahlow J
39843286,Efficacy and empathy of AI chatbots in answering frequently asked questions on oral oncology.,2025,"Oral surgery, oral medicine, oral pathology and oral radiology",,,,"OBJECTIVES: Artificial intelligence chatbots have demonstrated feasibility and efficacy in improving health outcomes. In this study, responses from 5 different publicly available AI chatbots-Bing, GPT-3.5, GPT-4, Google Bard, and Claude-to frequently asked questions related to oral cancer were evaluated. STUDY DESIGN: Relevant patient-related frequently asked questions about oral cancer were obtained from two main sources: public health websites and social media platforms. From these sources, 20 oral cancer-related questions were selected. Four board-certified specialists in oral medicine/oral and maxillofacial pathology assessed the answers using modified version of the global quality score on a 5-point Likert scale. Additionally, readability was measured using the Flesch-Kincaid Grade Level and Flesch Reading Ease scores. Responses were also assessed for empathy using a validated 5-point scale. RESULTS: Specialists ranked GPT-4 with highest total score of 17.3 +/- 1.5, while Bing received the lowest at 14.9 +/- 2.2. Bard had the highest Flesch Reading Ease score of 62 +/- 7; and ChatGPT-3.5 and Claude received the lowest scores (more challenging readability). GPT-4 and Bard emerged as the most superior chatbots in terms of empathy and accurate citations on patient-related frequently asked questions pertaining to oral cancer. GPT-4 had highest overall quality, whereas Bing showed the lowest level of quality, empathy, and accuracy for citations. CONCLUSION: GPT-4 demonstrated the highest quality responses to frequently asked questions pertaining to oral cancer. Although impressive in their ability to guide patients on common oral cancer topics, most chatbots did not perform well when assessed for empathy or citation accuracy.",Rokhshad R; Khoury ZH; Mohammad-Rahimi H; Motie P; Price JB; Tavares T; Jessri M; Bavarian R; Sciubba JJ; Sultan AS
37529789,"Performance of emergency triage prediction of an open access natural language processing based chatbot application (ChatGPT): A preliminary, scenario-based cross-sectional study.",2023,Turkish journal of emergency medicine,,,,"OBJECTIVES: Artificial intelligence companies have been increasing their initiatives recently to improve the results of chatbots, which are software programs that can converse with a human in natural language. The role of chatbots in health care is deemed worthy of research. OpenAI's ChatGPT is a supervised and empowered machine learning-based chatbot. The aim of this study was to determine the performance of ChatGPT in emergency medicine (EM) triage prediction. METHODS: This was a preliminary, cross-sectional study conducted with case scenarios generated by the researchers based on the emergency severity index (ESI) handbook v4 cases. Two independent EM specialists who were experts in the ESI triage scale determined the triage categories for each case. A third independent EM specialist was consulted as arbiter, if necessary. Consensus results for each case scenario were assumed as the reference triage category. Subsequently, each case scenario was queried with ChatGPT and the answer was recorded as the index triage category. Inconsistent classifications between the ChatGPT and reference category were defined as over-triage (false positive) or under-triage (false negative). RESULTS: Fifty case scenarios were assessed in the study. Reliability analysis showed a fair agreement between EM specialists and ChatGPT (Cohen's Kappa: 0.341). Eleven cases (22%) were over triaged and 9 (18%) cases were under triaged by ChatGPT. In 9 cases (18%), ChatGPT reported two consecutive triage categories, one of which matched the expert consensus. It had an overall sensitivity of 57.1% (95% confidence interval [CI]: 34-78.2), specificity of 34.5% (95% CI: 17.9-54.3), positive predictive value (PPV) of 38.7% (95% CI: 21.8-57.8), negative predictive value (NPV) of 52.6 (95% CI: 28.9-75.6), and an F1 score of 0.461. In high acuity cases (ESI-1 and ESI-2), ChatGPT showed a sensitivity of 76.2% (95% CI: 52.8-91.8), specificity of 93.1% (95% CI: 77.2-99.2), PPV of 88.9% (95% CI: 65.3-98.6), NPV of 84.4 (95% CI: 67.2-94.7), and an F1 score of 0.821. The receiver operating characteristic curve showed an area under the curve of 0.846 (95% CI: 0.724-0.969, P < 0.001) for high acuity cases. CONCLUSION: The performance of ChatGPT was best when predicting high acuity cases (ESI-1 and ESI-2). It may be useful when determining the cases requiring critical care. When trained with more medical knowledge, ChatGPT may be more accurate for other triage category predictions.",Sarbay I; Berikol GB; Ozturan IU
38895543,ChatGPT-3.5 passes Poland's medical final examination-Is it possible for ChatGPT to become a doctor in Poland?,2024,SAGE open medicine,,,,"OBJECTIVES: ChatGPT is an advanced chatbot based on Large Language Model that has the ability to answer questions. Undoubtedly, ChatGPT is capable of transforming communication, education, and customer support; however, can it play the role of a doctor? In Poland, prior to obtaining a medical diploma, candidates must successfully pass the Medical Final Examination. METHODS: The purpose of this research was to determine how well ChatGPT performed on the Polish Medical Final Examination, which passing is required to become a doctor in Poland (an exam is considered passed if at least 56% of the tasks are answered correctly). A total of 2138 categorized Medical Final Examination questions (from 11 examination sessions held between 2013-2015 and 2021-2023) were presented to ChatGPT-3.5 from 19 to 26 May 2023. For further analysis, the questions were divided into quintiles based on difficulty and duration, as well as question types (simple A-type or complex K-type). The answers provided by ChatGPT were compared to the official answer key, reviewed for any changes resulting from the advancement of medical knowledge. RESULTS: ChatGPT correctly answered 53.4%-64.9% of questions. In 8 out of 11 exam sessions, ChatGPT achieved the scores required to successfully pass the examination (60%). The correlation between the efficacy of artificial intelligence and the level of complexity, difficulty, and length of a question was found to be negative. AI outperformed humans in one category: psychiatry (77.18% vs. 70.25%, p = 0.081). CONCLUSIONS: The performance of artificial intelligence is deemed satisfactory; however, it is observed to be markedly inferior to that of human graduates in the majority of instances. Despite its potential utility in many medical areas, ChatGPT is constrained by its inherent limitations that prevent it from entirely supplanting human expertise and knowledge.",Suwala S; Szulc P; Guzowski C; Kaminska B; Dorobiala J; Wojciechowska K; Berska M; Kubicka O; Kosturkiewicz O; Kosztulska B; Rajewska A; Junik R
37083166,Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI).,2023,Clinical chemistry and laboratory medicine,,,,"OBJECTIVES: ChatGPT, a tool based on natural language processing (NLP), is on everyone's mind, and several potential applications in healthcare have been already proposed. However, since the ability of this tool to interpret laboratory test results has not yet been tested, the EFLM Working group on Artificial Intelligence (WG-AI) has set itself the task of closing this gap with a systematic approach. METHODS: WG-AI members generated 10 simulated laboratory reports of common parameters, which were then passed to ChatGPT for interpretation, according to reference intervals (RI) and units, using an optimized prompt. The results were subsequently evaluated independently by all WG-AI members with respect to relevance, correctness, helpfulness and safety. RESULTS: ChatGPT recognized all laboratory tests, it could detect if they deviated from the RI and gave a test-by-test as well as an overall interpretation. The interpretations were rather superficial, not always correct, and, only in some cases, judged coherently. The magnitude of the deviation from the RI seldom plays a role in the interpretation of laboratory tests, and artificial intelligence (AI) did not make any meaningful suggestion regarding follow-up diagnostics or further procedures in general. CONCLUSIONS: ChatGPT in its current form, being not specifically trained on medical data or laboratory data in particular, may only be considered a tool capable of interpreting a laboratory report on a test-by-test basis at best, but not on the interpretation of an overall diagnostic picture. Future generations of similar AIs with medical ground truth training data might surely revolutionize current processes in healthcare, despite this implementation is not ready yet.",Cadamuro J; Cabitza F; Debeljak Z; De Bruyne S; Frans G; Perez SM; Ozdemir H; Tolios A; Carobene A; Padoan A
37992892,Assessing the applicability and appropriateness of ChatGPT in answering clinical pharmacy questions.,2024,Annales pharmaceutiques francaises,,,,"OBJECTIVES: Clinical pharmacists rely on different scientific references to ensure appropriate, safe, and cost-effective drug use. Tools based on artificial intelligence (AI) such as ChatGPT (Generative Pre-trained Transformer) could offer valuable support. The objective of this study was to assess ChatGPT's capacity to correctly respond to clinical pharmacy questions asked by healthcare professionals in our university hospital. MATERIAL AND METHODS: ChatGPT's capacity to respond correctly to the last 100 consecutive questions recorded in our clinical pharmacy database was assessed. Questions were copied from our FileMaker Pro database and pasted into ChatGPT March 14 version online platform. The generated answers were then copied verbatim into an Excel file. Two blinded clinical pharmacists reviewed all the questions and the answers given by the software. In case of disagreements, a third blinded pharmacist intervened to decide. RESULTS: Documentation-related issues (n=36) and drug administration mode (n=30) were preponderantly recorded. Among 69 applicable questions, the rate of correct answers varied from 30 to 57.1% depending on questions type with a global rate of 44.9%. Regarding inappropriate answers (n=38), 20 were incorrect, 18 gave no answers and 8 were incomplete with 8 answers belonging to 2 different categories. No better answers than the pharmacists were observed. CONCLUSIONS: ChatGPT demonstrated a mitigated performance in answering clinical pharmacy questions. It should not replace human expertise as a high rate of inappropriate answers was highlighted. Future studies should focus on the optimization of ChatGPT for specific clinical pharmacy questions and explore the potential benefits and limitations of integrating this technology into clinical practice.",Fournier A; Fallet C; Sadeghipour F; Perrottet N
39661726,Medical language matters: impact of clinical summary composition on a generative artificial intelligence's diagnostic accuracy.,2025,"Diagnosis (Berlin, Germany)",,,,"OBJECTIVES: Evaluate the impact of problem representation (PR) characteristics on Generative Artificial Intelligence (GAI) diagnostic accuracy. METHODS: Internal medicine attendings and residents from two academic medical centers were given a clinical vignette and instructed to write a PR. Deductive content analysis described the characteristics comprising each PR. Individual PRs were input into ChatGPT-4 (OpenAI, September 2023) which was prompted to generate a ranked three-item differential. The ranked differential and the top-ranked diagnosis were scored on a 3-part scale, ranging from incorrect, partially correct, to correct. Logistic regression evaluated individual PR characteristic's impact on ChatGPT accuracy. RESULTS: For a three-item differential, accuracy was associated with including fewer comorbidities (OR 0.57, p=0.010), fewer past historical items (OR 0.60, p=0.019), and more physical examination items (OR 1.66, p=0.015). For ChatGPT's ability to rank the true diagnosis as the single-best diagnosis, utilizing temporal semantic qualifiers, more semantic qualifiers overall, and adhering to a typical 3-part PR format all correlated with diagnostic accuracy: OR 3.447, p=0.046; OR 1.300, p=0.005; OR 3.577, p=0.020, respectively. CONCLUSIONS: Several distinct PR factors improved ChatGPT diagnostic accuracy. These factors have previously been associated with expertise in creating PR. Future studies should explore how clinical input qualities affect GAI diagnostic accuracy prospectively.",Skittle C; Bonifacino E; McQuade CN
38804035,"Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum.",2024,Clinical chemistry and laboratory medicine,,,,"OBJECTIVES: Laboratory medical reports are often not intuitively comprehensible to non-medical professionals. Given their recent advancements, easier accessibility and remarkable performance on medical licensing exams, patients are therefore likely to turn to artificial intelligence-based chatbots to understand their laboratory results. However, empirical studies assessing the efficacy of these chatbots in responding to real-life patient queries regarding laboratory medicine are scarce. METHODS: Thus, this investigation included 100 patient inquiries from an online health forum, specifically addressing Complete Blood Count interpretation. The aim was to evaluate the proficiency of three artificial intelligence-based chatbots (ChatGPT, Gemini and Le Chat) against the online responses from certified physicians. RESULTS: The findings revealed that the chatbots' interpretations of laboratory results were inferior to those from online medical professionals. While the chatbots exhibited a higher degree of empathetic communication, they frequently produced erroneous or overly generalized responses to complex patient questions. The appropriateness of chatbot responses ranged from 51 to 64 %, with 22 to 33 % of responses overestimating patient conditions. A notable positive aspect was the chatbots' consistent inclusion of disclaimers regarding its non-medical nature and recommendations to seek professional medical advice. CONCLUSIONS: The chatbots' interpretations of laboratory results from real patient queries highlight a dangerous dichotomy - a perceived trustworthiness potentially obscuring factual inaccuracies. Given the growing inclination towards self-diagnosis using AI platforms, further research and improvement of these chatbots is imperative to increase patients' awareness and avoid future burdens on the healthcare system.",Meyer A; Soleman A; Riese J; Streichert T
40113208,Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?,2025,Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases,,,,"OBJECTIVES: Large language models (LLMs) show promise in clinical decision-making, but comparative evaluations of their antibiotic prescribing accuracy are limited. This study assesses the performance of various LLMs in recommending antibiotic treatments across diverse clinical scenarios. METHODS: Fourteen LLMs, including standard and premium versions of ChatGPT, Claude, Copilot, Gemini, Le Chat, Grok, Perplexity, and Pi.ai, were evaluated using 60 clinical cases with antibiograms covering 10 infection types. A standardized prompt was used for antibiotic recommendations focusing on drug choice, dosage, and treatment duration. Responses were anonymized and reviewed by a blinded expert panel assessing antibiotic appropriateness, dosage correctness, and duration adequacy. RESULTS: A total of 840 responses were collected and analysed. ChatGPT-o1 demonstrated the highest accuracy in antibiotic prescriptions, with 71.7% (43/60) of its recommendations classified as correct and only one (1.7%) incorrect. Gemini and Claude 3 Opus had the lowest accuracy. Dosage correctness was highest for ChatGPT-o1 (96.7%, 58/60), followed by Perplexity Pro (90.0%, 54/60) and Claude 3.5 Sonnet (91.7%, 55/60). In treatment duration, Gemini provided the most appropriate recommendations (75.0%, 45/60), whereas Claude 3.5 Sonnet tended to over-prescribe duration. Performance declined with increasing case complexity, particularly for difficult-to-treat microorganisms. DISCUSSION: There is significant variability among LLMs in prescribing appropriate antibiotics, dosages, and treatment durations. ChatGPT-o1 outperformed other models, indicating the potential of advanced LLMs as decision-support tools in antibiotic prescribing. However, decreased accuracy in complex cases and inconsistencies among models highlight the need for careful validation before clinical utilization.",De Vito A; Geremia N; Bavaro DF; Seo SK; Laracy J; Mazzitelli M; Marino A; Maraolo AE; Russo A; Colpani A; Bartoletti M; Cattelan AM; Mussini C; Parisi SG; Vaira LA; Nunnari G; Madeddu G
39393839,Investigating the capabilities of advanced large language models in generating patient instructions and patient educational material.,2024,European journal of hospital pharmacy : science and practice,,,,"OBJECTIVES: Large language models (LLMs) with advanced language generation capabilities have the potential to enhance patient interactions. This study evaluates the effectiveness of ChatGPT 4.0 and Gemini 1.0 Pro in providing patient instructions and creating patient educational material (PEM). METHODS: A cross-sectional study employed ChatGPT 4.0 and Gemini 1.0 Pro across six medical scenarios using simple and detailed prompts. The Patient Education Materials Assessment Tool for Print materials (PEMAT-P) evaluated the understandability, actionability, and readability of the outputs. RESULTS: LLMs provided consistent responses, especially regarding drug information, therapeutic goals, administration, common side effects, and interactions. However, they lacked guidance on expiration dates and proper medication disposal. Detailed prompts yielded comprehensible outputs for the average adult. ChatGPT 4.0 had mean understandability and actionability scores of 80% and 60%, respectively, compared with 67% and 60% for Gemini 1.0 Pro. ChatGPT 4.0 produced longer outputs, achieving 85% readability with detailed prompts, while Gemini 1.0 Pro maintained consistent readability. Simple prompts resulted in ChatGPT 4.0 outputs at a 10th-grade reading level, while Gemini 1.0 Pro outputs were at a 7th-grade level. Both LLMs produced outputs at a 6th-grade level with detailed prompts. CONCLUSION: LLMs show promise in generating patient instructions and PEM. However, healthcare professional oversight and patient education on LLM use are essential for effective implementation.",Sridharan K; Sivaramakrishnan G
38046089,Brain versus bot: Distinguishing letters of recommendation authored by humans compared with artificial intelligence.,2023,AEM education and training,,,,"OBJECTIVES: Letters of recommendation (LORs) are essential within academic medicine, affecting a number of important decisions regarding advancement, yet these letters take significant amounts of time and labor to prepare. The use of generative artificial intelligence (AI) tools, such as ChatGPT, are gaining popularity for a variety of academic writing tasks and offer an innovative solution to relieve the burden of letter writing. It is yet to be determined if ChatGPT could aid in crafting LORs, particularly in high-stakes contexts like faculty promotion. To determine the feasibility of this process and whether there is a significant difference between AI and human-authored letters, we conducted a study aimed at determining whether academic physicians can distinguish between the two. METHODS: A quasi-experimental study was conducted using a single-blind design. Academic physicians with experience in reviewing LORs were presented with LORs for promotion to associate professor, written by either humans or AI. Participants reviewed LORs and identified the authorship. Statistical analysis was performed to determine accuracy in distinguishing between human and AI-authored LORs. Additionally, the perceived quality and persuasiveness of the LORs were compared based on suspected and actual authorship. RESULTS: A total of 32 participants completed letter review. The mean accuracy of distinguishing between human- versus AI-authored LORs was 59.4%. The reviewer's certainty and time spent deliberating did not significantly impact accuracy. LORs suspected to be human-authored were rated more favorably in terms of quality and persuasiveness. A difference in gender-biased language was observed in our letters: human-authored letters contained significantly more female-associated words, while the majority of AI-authored letters tended to use more male-associated words. CONCLUSIONS: Participants were unable to reliably differentiate between human- and AI-authored LORs for promotion. AI may be able to generate LORs and relieve the burden of letter writing for academicians. New strategies, policies, and guidelines are needed to balance the benefits of AI while preserving integrity and fairness in academic promotion decisions.",Preiksaitis C; Nash C; Gottlieb M; Chan TM; Alvarez A; Landry A
38857454,RefAI: a GPT-powered retrieval-augmented generative tool for biomedical literature recommendation and summarization.,2024,Journal of the American Medical Informatics Association : JAMIA,,,,"OBJECTIVES: Precise literature recommendation and summarization are crucial for biomedical professionals. While the latest iteration of generative pretrained transformer (GPT) incorporates 2 distinct modes-real-time search and pretrained model utilization-it encounters challenges in dealing with these tasks. Specifically, the real-time search can pinpoint some relevant articles but occasionally provides fabricated papers, whereas the pretrained model excels in generating well-structured summaries but struggles to cite specific sources. In response, this study introduces RefAI, an innovative retrieval-augmented generative tool designed to synergize the strengths of large language models (LLMs) while overcoming their limitations. MATERIALS AND METHODS: RefAI utilized PubMed for systematic literature retrieval, employed a novel multivariable algorithm for article recommendation, and leveraged GPT-4 turbo for summarization. Ten queries under 2 prevalent topics (""cancer immunotherapy and target therapy"" and ""LLMs in medicine"") were chosen as use cases and 3 established counterparts (ChatGPT-4, ScholarAI, and Gemini) as our baselines. The evaluation was conducted by 10 domain experts through standard statistical analyses for performance comparison. RESULTS: The overall performance of RefAI surpassed that of the baselines across 5 evaluated dimensions-relevance and quality for literature recommendation, accuracy, comprehensiveness, and reference integration for summarization, with the majority exhibiting statistically significant improvements (P-values <.05). DISCUSSION: RefAI demonstrated substantial improvements in literature recommendation and summarization over existing tools, addressing issues like fabricated papers, metadata inaccuracies, restricted recommendations, and poor reference integration. CONCLUSION: By augmenting LLM with external resources and a novel ranking algorithm, RefAI is uniquely capable of recommending high-quality literature and generating well-structured summaries, holding the potential to meet the critical needs of biomedical professionals in navigating and synthesizing vast amounts of scientific literature.",Li Y; Zhao J; Li M; Dang Y; Yu E; Li J; Sun Z; Hussein U; Wen J; Abdelhameed AM; Mai J; Li S; Yu Y; Hu X; Yang D; Feng J; Li Z; He J; Tao W; Duan T; Lou Y; Li F; Tao C
37945282,Heart-to-heart with ChatGPT: the impact of patients consulting AI for cardiovascular health advice.,2023,Open heart,,,,"OBJECTIVES: The advent of conversational artificial intelligence (AI) systems employing large language models such as ChatGPT has sparked public, professional and academic debates on the capabilities of such technologies. This mixed-methods study sets out to review and systematically explore the capabilities of ChatGPT to adequately provide health advice to patients when prompted regarding four topics from the field of cardiovascular diseases. METHODS: As of 30 May 2023, 528 items on PubMed contained the term ChatGPT in their title and/or abstract, with 258 being classified as journal articles and included in our thematic state-of-the-art review. For the experimental part, we systematically developed and assessed 123 prompts across the four topics based on three classes of users and two languages. Medical and communications experts scored ChatGPT's responses according to the 4Cs of language model evaluation proposed in this article: correct, concise, comprehensive and comprehensible. RESULTS: The articles reviewed were fairly evenly distributed across discussing how ChatGPT could be used for medical publishing, in clinical practice and for education of medical personnel and/or patients. Quantitatively and qualitatively assessing the capability of ChatGPT on the 123 prompts demonstrated that, while the responses generally received above-average scores, they occupy a spectrum from the concise and correct via the absurd to what only can be described as hazardously incorrect and incomplete. Prompts formulated at higher levels of health literacy generally yielded higher-quality answers. Counterintuitively, responses in a lower-resource language were often of higher quality. CONCLUSIONS: The results emphasise the relationship between prompt and response quality and hint at potentially concerning futures in personalised medicine. The widespread use of large language models for health advice might amplify existing health inequalities and will increase the pressure on healthcare systems by providing easy access to many seemingly likely differential diagnoses and recommendations for seeing a doctor for even harmless ailments.",Lautrup AD; Hyrup T; Schneider-Kamp A; Dahl M; Lindholt JS; Schneider-Kamp P
39045939,Comparative performance of artificial intelligence models in physical medicine and rehabilitation board-level questions.,2024,Revista da Associacao Medica Brasileira (1992),,,,"OBJECTİVES: The aim of this study was to compare the performance of artificial intelligence models ChatGPT-3.5, ChatGPT-4, and Google Bard in answering Physical Medicine and Rehabilitation board-style questions, assessing their capabilities in medical education and potential clinical applications. METHODS: A comparative cross-sectional study was conducted using the PMR100, an example question set for the American Board of Physical Medicine and Rehabilitation Part I exam, focusing on artificial intelligence models' ability to answer and categorize questions by difficulty. The study evaluated the artificial intelligence models and analyzed them for accuracy, reliability, and alignment with difficulty levels determined by physiatrists. RESULTS: ChatGPT-4 led with a 74% success rate, followed by Bard at 66%, and ChatGPT-3.5 at 63.8%. Bard showed remarkable answer consistency, altering responses in only 1% of cases. The difficulty assessment by ChatGPT models closely matched that of physiatrists. The study highlighted nuanced differences in artificial intelligence models' performance across various Physical Medicine and Rehabilitation subfields. CONCLUSION: The study illustrates the potential of artificial intelligence in medical education and clinical settings, with ChatGPT-4 showing a slight edge in performance. It emphasizes the importance of artificial intelligence as a supportive tool for physiatrists, despite the need for careful oversight of artificial intelligence-generated responses to ensure patient safety.",Menekseoglu AK; Is EE
38584026,Evaluating the Potential of AI Chatbots in Treatment Decision-making for Acquired Bilateral Vocal Fold Paralysis in Adults.,2024,Journal of voice : official journal of the Voice Foundation,,,,"OBJECTIVES: The development of artificial intelligence-powered language models, such as Chatbot Generative Pre-trained Transformer (ChatGPT) or Large Language Model Meta AI (Llama), is emerging in medicine. Patients and practitioners have full access to chatbots that may provide medical information. The aim of this study was to explore the performance and accuracy of ChatGPT and Llama in treatment decision-making for bilateral vocal fold paralysis (BVFP). METHODS: Data of 20 clinical cases, treated between 2018 and 2023, were retrospectively collected from four tertiary laryngology centers in Europe. The cases were defined as the most common or most challenging scenarios regarding BVFP treatment. The treatment proposals were discussed in their local multidisciplinary teams (MDT). Each case was presented to ChatGPT-4.0 and Llama Chat-2.0, and potential treatment strategies were requested. The Artificial Intelligence Performance Instrument (AIPI) treatment subscore was used to compare both Chatbots' performances to MDT treatment proposal. RESULTS: Most common etiology of BVFP was thyroid surgery. A form of partial arytenoidectomy with or without posterior transverse cordotomy was the MDT proposal for most cases. The accuracy of both Chatbots was very low regarding their treatment proposals, with a maximum AIPI treatment score in 5% of the cases. In most cases even harmful assertions were made, including the suggestion of vocal fold medialisation to treat patients with stridor and dyspnea. ChatGPT-4.0 performed significantly better in suggesting the correct treatment as part of the treatment proposal (50%) compared to Llama Chat-2.0 (15%). CONCLUSION: ChatGPT and Llama are judged as inaccurate in proposing correct treatment for BVFP. ChatGPT significantly outperformed Llama. Treatment decision-making for a complex condition such as BVFP is clearly beyond the Chatbot's knowledge expertise. This study highlights the complexity and heterogeneity of BVFP treatment, and the need for further guidelines dedicated to the management of BVFP.",Dronkers EAC; Geneid A; Al Yaghchi C; Lechien JR
40271313,Precision Oncology in Non-small Cell Lung Cancer: A Comparative Study of Contextualized ChatGPT Models.,2025,Cureus,,,,"OBJECTIVES: The growing adoption of Large Language Models (LLMs) in medicine has raised important questions about their potential utility for clinical decision support within oncology. This study aimed to evaluate the effects of various contextualization methods on ChatGPT's ability to provide National Comprehensive Cancer Network (NCCN) guideline-aligned recommendations on managing non-small cell lung cancer (NSCLC). METHODOLOGY: GPT-4o, base GPT-4, and GPT-4 models contextualized with prompts and PDF documents were asked to identify preferred chemotherapies for twelve advanced lung cancers given molecular profiles derived from the 2024 NCCN Clinical Practice Guidelines in Oncology for NSCLC. GPT responses were subsequently compared to NCCN guidelines using readability scores and qualitative reviewer assessments of (1) recommendation of specific targeted therapy, (2) agreement with NCCN-guideline-preferred therapies, (3) recommendation of guideline non-concordant therapies, and (4) provision of supplementary information. RESULTS: The PDF+Prompt contextualized model demonstrated elevated agreement scores of 23/24 versus 17/24 for GPT-4 (P = 0.040) and 18/24 for GPT-4o (P = 0.089). No PDF+Prompt model responses contained guideline non-concordant therapies in contrast to 4/12 responses for GPT4 (P = 0.093) and 5/12 responses for GPT4o (P = 0.037). Comparison of response readability between the PDF+Prompt model and GPT-4 or GPT-4o showed a lower mean word count (both P < 0.001), Simple Measure of Gobbledygook (SMOG) score (both P < 0.001), and Gunning Fog readability score (P < 0.001 for GPT-4, P = 0.002 for GPT-4o). Prompting alone did not significantly improve agreement or reduce the rate of non-concordant therapy recommendations. CONCLUSIONS: The performance gains observed following contextualization suggest that broader applications of LLMs in oncology may exist than current literature indicates. This study provides proof of concept for the use of contextualized GPT models in oncology and showcases their accessibility. Future studies validating this application within additional cancer types or real-life patient encounters could provide an important bridge to eventual adoption.",Brown EDL; Shah HA; Donnelly BM; Ward M; Vojnic M; D'Amico RS
38830507,Enhancing Diagnostic Support for Chiari Malformation and Syringomyelia: A Comparative Study of Contextualized ChatGPT Models.,2024,World neurosurgery,,,,"OBJECTIVES: The rapidly increasing adoption of large language models in medicine has drawn attention to potential applications within the field of neurosurgery. This study evaluates the effects of various contextualization methods on ChatGPT's ability to provide expert-consensus aligned recommendations on the diagnosis and management of Chiari Malformation and Syringomyelia. METHODS: Native GPT4 and GPT4 models contextualized using various strategies were asked questions revised from the 2022 Chiari and Syringomyelia Consortium International Consensus Document. ChatGPT-provided responses were then compared to consensus statements using reviewer assessments of 1) responding to the prompt, 2) agreement of ChatGPT response with consensus statements, 3) recommendation to consult with a medical professional, and 4) presence of supplementary information. Flesch-Kincaid, SMOG, word count, and Gunning-Fog readability scores were calculated for each model using the quanteda package in R. RESULTS: Relative to GPT4, all contextualized GPTs demonstrated increased agreement with consensus statements. PDF+Prompting and Prompting models provided the most elevated agreement scores of 19 of 24 and 23 of 24, respectively, versus 9 of 24 for GPT4 (p=.021, p=.001). A trend toward improved readability was observed when comparing contextualized models at large to ChatGPT4, with significant decreases in average word count (180.7 vs 382.3, p<.001) and Flesch-Kincaid Reading Ease score (11.7 vs 17.2, p=.033). CONCLUSIONS: The enhanced performance observed in response to ChatGPT4 contextualization suggests broader applications of large language models in neurosurgery than what the current literature indicates. This study provides proof of concept for the use of contextualized GPT models in neurosurgical contexts and showcases the easy accessibility of improved model performance.",Brown EDL; Ward M; Maity A; Mittler MA; Larry Lo SF; D'Amico RS
37494894,Diagnostic and Management Performance of ChatGPT in Obstetrics and Gynecology.,2023,Gynecologic and obstetric investigation,,,,"OBJECTIVES: The use of artificial intelligence (AI) in clinical patient management and medical education has been advancing over time. ChatGPT was developed and trained recently, using a large quantity of textual data from the internet. Medical science is expected to be transformed by its use. The present study was conducted to evaluate the diagnostic and management performance of the ChatGPT AI model in obstetrics and gynecology. DESIGN: A cross-sectional study was conducted. PARTICIPANTS/MATERIALS, SETTING, METHODS: This study was conducted in Iran in March 2023. Medical histories and examination results of 30 cases were determined in six areas of obstetrics and gynecology. The cases were presented to a gynecologist and ChatGPT for diagnosis and management. Answers from the gynecologist and ChatGPT were compared, and the diagnostic and management performance of ChatGPT were determined. RESULTS: Ninety percent (27 of 30) of the cases in obstetrics and gynecology were correctly handled by ChatGPT. Its responses were eloquent, informed, and free of a significant number of errors or misinformation. Even when the answers provided by ChatGPT were incorrect, the responses contained a logical explanation about the case as well as information provided in the question stem. LIMITATIONS: The data used in this study were taken from the electronic book and may reflect bias in the diagnosis of ChatGPT. CONCLUSIONS: This is the first evaluation of ChatGPT's performance in diagnosis and management in the field of obstetrics and gynecology. It appears that ChatGPT has potential applications in the practice of medicine and is (currently) free and simple to use. However, several ethical considerations and limitations such as bias, validity, copyright infringement, and plagiarism need to be addressed in future studies.",Allahqoli L; Ghiasvand MM; Mazidimoradi A; Salehiniya H; Alkatout I
39372551,"Based on Medicine, The Now and Future of Large Language Models.",2024,Cellular and molecular bioengineering,,,,"OBJECTIVES: This review explores the potential applications of large language models (LLMs) such as ChatGPT, GPT-3.5, and GPT-4 in the medical field, aiming to encourage their prudent use, provide professional support, and develop accessible medical AI tools that adhere to healthcare standards. METHODS: This paper examines the impact of technologies such as OpenAI's Generative Pre-trained Transformers (GPT) series, including GPT-3.5 and GPT-4, and other large language models (LLMs) in medical education, scientific research, clinical practice, and nursing. Specifically, it includes supporting curriculum design, acting as personalized learning assistants, creating standardized simulated patient scenarios in education; assisting with writing papers, data analysis, and optimizing experimental designs in scientific research; aiding in medical imaging analysis, decision-making, patient education, and communication in clinical practice; and reducing repetitive tasks, promoting personalized care and self-care, providing psychological support, and enhancing management efficiency in nursing. RESULTS: LLMs, including ChatGPT, have demonstrated significant potential and effectiveness in the aforementioned areas, yet their deployment in healthcare settings is fraught with ethical complexities, potential lack of empathy, and risks of biased responses. CONCLUSION: Despite these challenges, significant medical advancements can be expected through the proper use of LLMs and appropriate policy guidance. Future research should focus on overcoming these barriers to ensure the effective and ethical application of LLMs in the medical field.",Su Z; Tang G; Huang R; Qiao Y; Zhang Z; Dai X
39610079,Application value of generative artificial intelligence in the field of stomatology.,2024,Hua xi kou qiang yi xue za zhi = Huaxi kouqiang yixue zazhi = West China journal of stomatology,,,,"OBJECTIVES: This study aims to compare and analyze three types of generative artificial intelligence (GAI) and explore their application value and existing problems in the field of stomatology in the Chinese context. METHODS: A total of 36 questions were designed, covering all the professional areas of stomatology. The questions encompassed various aspects including medical records, professional knowledge, and translation and editing. These questions were submitted to ChatGPT4-turbo, Gemini (2024.2) and ERNIE Bot 4.0. After obtaining the answers, a blinded evaluation was conducted by three experienced oral medicine physicians using a four-point Likert scale. The value of GAI in various application scenarios was evaluated. RESULTS: Gemini scored 45, ERNIE Bot scored 38, and ChatGPT scored 33 for clinical documentation and image production. For research assistance, Gemini achieved 45, ERNIE Bot had 39, and ChatGPT scored 35. Teaching assistance capabilities were rated at 54 for ERNIE Bot, 50 for Gemini, and 48 for ChatGPT. In patient consultation and guidance, Gemini scored 78, ERNIE Bot scored 59, and ChatGPT scored 48. Overall, the total scores were 218, 190, and 164 for Gemini, ERNIE Bot, and ChatGPT, respectively. Among GAI applications, the top scoring categories were article translation and polishing (26), patient-doctor communication documentation (23), and popular science content creation (23). The lowest scoring categories were literature search and reporting (13) and image generation (12). CONCLUSIONS: In the Chinese context, the application value of GAI is the highest for Gemini, followed by ERNIE Bot and ChatGPT. GAI shows significant value in translation, patient-doctor communication, and popular science writing. However, its value in literature search, reporting, and image generation remains limited.",Ye Y; Zeng W; Chen J; Liu L
39325705,Is ChatGPT 3.5 smarter than Otolaryngology trainees? A comparison study of board style exam questions.,2024,PloS one,,,,"OBJECTIVES: This study compares the performance of the artificial intelligence (AI) platform Chat Generative Pre-Trained Transformer (ChatGPT) to Otolaryngology trainees on board-style exam questions. METHODS: We administered a set of 30 Otolaryngology board-style questions to medical students (MS) and Otolaryngology residents (OR). 31 MSs and 17 ORs completed the questionnaire. The same test was administered to ChatGPT version 3.5, five times. Comparisons of performance were achieved using a one-way ANOVA with Tukey Post Hoc test, along with a regression analysis to explore the relationship between education level and performance. RESULTS: The average scores increased each year from MS1 to PGY5. A one-way ANOVA revealed that ChatGPT outperformed trainee years MS1, MS2, and MS3 (p = <0.001, 0.003, and 0.019, respectively). PGY4 and PGY5 otolaryngology residents outperformed ChatGPT (p = 0.033 and 0.002, respectively). For years MS4, PGY1, PGY2, and PGY3 there was no statistical difference between trainee scores and ChatGPT (p = .104, .996, and 1.000, respectively). CONCLUSION: ChatGPT can outperform lower-level medical trainees on Otolaryngology board-style exam but still lacks the ability to outperform higher-level trainees. These questions primarily test rote memorization of medical facts; in contrast, the art of practicing medicine is predicated on the synthesis of complex presentations of disease and multilayered application of knowledge of the healing process. Given that upper-level trainees outperform ChatGPT, it is unlikely that ChatGPT, in its current form will provide significant clinical utility over an Otolaryngologist.",Patel J; Robinson P; Illing E; Anthony B
38481520,Automated HEART score determination via ChatGPT: Honing a framework for iterative prompt development.,2024,Journal of the American College of Emergency Physicians open,,,,"OBJECTIVES: This study presents a design framework to enhance the accuracy by which large language models (LLMs), like ChatGPT can extract insights from clinical notes. We highlight this framework via prompt refinement for the automated determination of HEART (History, ECG, Age, Risk factors, Troponin risk algorithm) scores in chest pain evaluation. METHODS: We developed a pipeline for LLM prompt testing, employing stochastic repeat testing and quantifying response errors relative to physician assessment. We evaluated the pipeline for automated HEART score determination across a limited set of 24 synthetic clinical notes representing four simulated patients. To assess whether iterative prompt design could improve the LLMs' ability to extract complex clinical concepts and apply rule-based logic to translate them to HEART subscores, we monitored diagnostic performance during prompt iteration. RESULTS: Validation included three iterative rounds of prompt improvement for three HEART subscores with 25 repeat trials totaling 1200 queries each for GPT-3.5 and GPT-4. For both LLM models, from initial to final prompt design, there was a decrease in the rate of responses with erroneous, non-numerical subscore answers. Accuracy of numerical responses for HEART subscores (discrete 0-2 point scale) improved for GPT-4 from the initial to final prompt iteration, decreasing from a mean error of 0.16-0.10 (95% confidence interval: 0.07-0.14) points. CONCLUSION: We established a framework for iterative prompt design in the clinical space. Although the results indicate potential for integrating LLMs in structured clinical note analysis, translation to real, large-scale clinical data with appropriate data privacy safeguards is needed.",Safranek CW; Huang T; Wright DS; Wright CX; Socrates V; Sangal RB; Iscoe M; Chartash D; Taylor RA
38034065,Assessment of Artificial Intelligence Performance on the Otolaryngology Residency In-Service Exam.,2023,OTO open,,,,"OBJECTIVES: This study seeks to determine the potential use and reliability of a large language learning model for answering questions in a sub-specialized area of medicine, specifically practice exam questions in otolaryngology-head and neck surgery and assess its current efficacy for surgical trainees and learners. STUDY DESIGN AND SETTING: All available questions from a public, paid-access question bank were manually input through ChatGPT. METHODS: Outputs from ChatGPT were compared against the benchmark of the answers and explanations from the question bank. Questions were assessed in 2 domains: accuracy and comprehensiveness of explanations. RESULTS: Overall, our study demonstrates a ChatGPT correct answer rate of 53% and a correct explanation rate of 54%. We find that with increasing difficulty of questions there is a decreasing rate of answer and explanation accuracy. CONCLUSION: Currently, artificial intelligence-driven learning platforms are not robust enough to be reliable medical education resources to assist learners in sub-specialty specific patient decision making scenarios.",Mahajan AP; Shabet CL; Smith J; Rudy SF; Kupfer RA; Bohm LA
39658118,Assessing the accuracy and efficiency of Chat GPT-4 Omni (GPT-4o) in biomedical statistics: Comparative study with traditional tools.,2024,Saudi medical journal,,,,"OBJECTIVES: To assess the accuracy of ChatGPT-4 Omni (GPT-4o) in biomedical statistics. The recent novel inauguration of Artificial Intelligence ChatGPT-Omni (GPT-4o), has emerged with the potential to analyze sophisticated and extensive data sets, challenging the expertise of statisticians using traditional statistical tools for data analysis. METHODS: This study was performed in the Department of Physiology, College of Medicine, King Saud University, Riyadh, Saudi Arabia, in May 2024. Three datasets in a raw Excel file format were imported onto Statistical Package for the Social Sciences (SPSS) version 29 for data analysis. Based on this analysis, a script of 9 questions was prepared to command GPT-4 Omni, which was used for data analysis for all 3 datasets on Omni. The score and the time were recorded for each result and verified after being compared to the original analysis results performed on SPSS. RESULTS: GPT-4 Omni scored 73 (85.88%) out of 85 points for all 3 datasets. All datasets took a total of 38.43 minutes to be fully analyzed. Individually, Omni scored 21/25 (84%) for the small dataset in 487.4 seconds, 20/25 (80%) for the middle dataset in 747.02 seconds and 32/35 (91.42%) for the large dataset in 1071 seconds. GPT-4 Omni produced accurate graphs and charts. CONCLUSION: ChatGPT-4 Omni scored better over 80% in all 3 statistical datasets in a short period. GPT-4 Omni also produced accurate graphs and charts as commanded however it required explicit commands with clear instructions to avoid errors and omission of results to achieve appropriate results in biomedical data analysis.",Meo AS; Shaikh N; Meo SA
37794249,ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.,2024,European radiology,,,,"OBJECTIVES: To assess the quality of simplified radiology reports generated with the large language model (LLM) ChatGPT and to discuss challenges and chances of ChatGPT-like LLMs for medical text simplification. METHODS: In this exploratory case study, a radiologist created three fictitious radiology reports which we simplified by prompting ChatGPT with ""Explain this medical report to a child using simple language."" In a questionnaire, we tasked 15 radiologists to rate the quality of the simplified radiology reports with respect to their factual correctness, completeness, and potential harm for patients. We used Likert scale analysis and inductive free-text categorization to assess the quality of the simplified reports. RESULTS: Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. Nevertheless, instances of incorrect statements, missed relevant medical information, and potentially harmful passages were reported. CONCLUSION: While we see a need for further adaption to the medical field, the initial insights of this study indicate a tremendous potential in using LLMs like ChatGPT to improve patient-centered care in radiology and other medical domains. CLINICAL RELEVANCE STATEMENT: Patients have started to use ChatGPT to simplify and explain their medical reports, which is expected to affect patient-doctor interaction. This phenomenon raises several opportunities and challenges for clinical routine. KEY POINTS: * Patients have started to use ChatGPT to simplify their medical reports, but their quality was unknown. * In a questionnaire, most participating radiologists overall asserted good quality to radiology reports simplified with ChatGPT. However, they also highlighted a notable presence of errors, potentially leading patients to draw harmful conclusions. * Large language models such as ChatGPT have vast potential to enhance patient-centered care in radiology and other medical domains. To realize this potential while minimizing harm, they need supervision by medical experts and adaption to the medical field.",Jeblick K; Schachtner B; Dexl J; Mittermeier A; Stuber AT; Topalis J; Weber T; Wesp P; Sabel BO; Ricke J; Ingrisch M
37698703,Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"OBJECTIVES: To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI). METHODS: Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance was rated twice using AIPI within a 7-day period to assess test-retest reliability. Internal consistency was evaluated using Cronbach's alpha. Internal validity was evaluated by comparing the AIPI scores of the clinical cases rated by ChatGPT and 2 blinded practitioners. Convergent validity was measured by comparing the AIPI score with a modified version of the Ottawa Clinical Assessment Tool (OCAT). Interrater reliability was assessed using Kendall's tau. RESULTS: Forty-five patients completed the evaluations (28 females). The AIPI Cronbach's alpha analysis suggested an adequate internal consistency (alpha = 0.754). The test-retest reliability was moderate-to-strong for items and the total score of AIPI (r(s) = 0.486, p = 0.001). The mean AIPI score of the senior otolaryngologist was significantly higher compared to the score of ChatGPT, supporting adequate internal validity (p = 0.001). Convergent validity reported a moderate and significant correlation between AIPI and modified OCAT (r(s) = 0.319; p = 0.044). The interrater reliability reported significant positive concordance between both otolaryngologists for the patient feature, diagnostic, additional examination, and treatment subscores as well as for the AIPI total score. CONCLUSIONS: AIPI is a valid and reliable instrument in assessing the performance of ChatGPT in ear, nose and throat conditions. Future studies are needed to investigate the usefulness of AIPI in medicine and surgery, and to evaluate the psychometric properties in these fields.",Lechien JR; Maniaci A; Gengler I; Hans S; Chiesa-Estomba CM; Vaira LA
37263772,Performance and risks of ChatGPT used in drug information: an exploratory real-world analysis.,2024,European journal of hospital pharmacy : science and practice,,,,"OBJECTIVES: To investigate the performance and risk associated with the usage of Chat Generative Pre-trained Transformer (ChatGPT) to answer drug-related questions. METHODS: A sample of 50 drug-related questions were consecutively collected and entered in the artificial intelligence software application ChatGPT. Answers were documented and rated in a standardised consensus process by six senior hospital pharmacists in the domains content (correct, incomplete, false), patient management (possible, insufficient, not possible) and risk (no risk, low risk, high risk). As reference, answers were researched in adherence to the German guideline of drug information and stratified in four categories according to the sources used. In addition, the reproducibility of ChatGPT's answers was analysed by entering three questions at different timepoints repeatedly (day 1, day 2, week 2, week 3). RESULTS: Overall, only 13 of 50 answers provided correct content and had enough information to initiate management with no risk of patient harm. The majority of answers were either false (38%, n=19) or had partly correct content (36%, n=18) and no references were provided. A high risk of patient harm was likely in 26% (n=13) of the cases and risk was judged low for 28% (n=14) of the cases. In all high-risk cases, actions could have been initiated based on the provided information. The answers of ChatGPT varied over time when entered repeatedly and only three out of 12 answers were identical, showing no reproducibility to low reproducibility. CONCLUSION: In a real-world sample of 50 drug-related questions, ChatGPT answered the majority of questions wrong or partly wrong. The use of artificial intelligence applications in drug information is not possible as long as barriers like wrong content, missing references and reproducibility remain.",Morath B; Chiriac U; Jaszkowski E; Deiss C; Nurnberg H; Horth K; Hoppe-Tichy T; Green K
40046775,"Capturing pharmacists' perspectives on the value, risks, and applications of ChatGPT in pharmacy practice: A qualitative study.",2024,Exploratory research in clinical and social pharmacy,,,,"OBJECTIVES: To investigate the pharmacists' perspectives on benefits and risks of using ChatGPT in pharmacy practice and explore how this disruptive and ground-breaking technology can be effectively integrated. METHODS AND MATERIALS: A qualitative approach that draws data from licensed pharmacists using semi-structured interviews. RESULTS: Most participants felt ChatGPT could enhance the compliance, use, management, safety, adherence to medication, medication counseling, minimize medication errors, and streamline medication dispensing. However, when Chat-GPT has limited information and relies on obsolete medication databases, it risks providing inaccurate recommendations and inadequate medication details. Also, most participants highlighted the difficulty in interpreting ambiguous patient input or drug descriptions when using the application. CONCLUSIONS: Despite its potential, utilizing ChatGPT in pharmacy practice must be dependent on evidence-based results that offer profound insight into AI technology.",Jairoun AA; Al-Hemyari SS; Shahwan M; Alnuaimi GR; Ibrahim N; Jaber AAS
39523628,Use of generative artificial intelligence (AI) in psychiatry and mental health care: a systematic review.,2024,Acta neuropsychiatrica,,,,"OBJECTIVES: Tools based on generative artificial intelligence (AI) such as ChatGPT have the potential to transform modern society, including the field of medicine. Due to the prominent role of language in psychiatry, e.g., for diagnostic assessment and psychotherapy, these tools may be particularly useful within this medical field. Therefore, the aim of this study was to systematically review the literature on generative AI applications in psychiatry and mental health. METHODS: We conducted a systematic review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. The search was conducted across three databases, and the resulting articles were screened independently by two researchers. The content, themes, and findings of the articles were qualitatively assessed. RESULTS: The search and screening process resulted in the inclusion of 40 studies. The median year of publication was 2023. The themes covered in the articles were mainly mental health and well-being in general - with less emphasis on specific mental disorders (substance use disorder being the most prevalent). The majority of studies were conducted as prompt experiments, with the remaining studies comprising surveys, pilot studies, and case reports. Most studies focused on models that generate language, ChatGPT in particular. CONCLUSIONS: Generative AI in psychiatry and mental health is a nascent but quickly expanding field. The literature mainly focuses on applications of ChatGPT, and finds that generative AI performs well, but notes that it is limited by significant safety and ethical concerns. Future research should strive to enhance transparency of methods, use experimental designs, ensure clinical relevance, and involve users/patients in the design phase.",Kolding S; Lundin RM; Hansen L; Ostergaard SD
39816417,Evaluation of ChatGPT-4 Performance in Answering Patients' Questions About the Management of Type 2 Diabetes.,2024,Sisli Etfal Hastanesi tip bulteni,,,,"OBJECTIVES: Type 2 diabetes mellitus is a disease with a rising prevalence worldwide. Person-centered treatment factors, including comorbidities and treatment goals, should be considered in determining the pharmacological treatment of type 2 diabetes. ChatGPT-4 (Generative Pre-trained Transformer), a large language model, holds the potential performance in various fields, including medicine. We aimed to examine the reliability, quality, reproducibility, and readability of ChatGPT-4's responses to clinical scenarios about the medical treatment approach and management of type 2 diabetes patients. METHODS: ChatGPT-4's responses to 24 questions were independently graded by two endocrinologists with clinical experience in endocrinology and resolved by a third reviewer based on the ADA(American Diabetes Association) 2023 guidelines. DISCERN (Quality Criteria for Consumer Health Information) Measurement Tool was used to evaluate the reliability and quality of information. RESULTS: Responses to questions by ChatGPT-4 were fairly consistent in both sessions. No false or misleading information was found in any ChatGPT-4 responses. In terms of reliability, most of the answers showed good (87.5%), followed by excellent (12.5%) reliability. Reading Level was classified as fairly difficult to read (8.3%), difficult to read (50%), and very difficult to read (41.7%). CONCLUSION: ChatGPT-4 may have a role as an additional informative tool for type 2 diabetes patients for medical treatment approaches.",Gokbulut P; Kuskonmaz SM; Onder CE; Taskaldiran I; Koc G
38964828,Assessment of the information provided by ChatGPT regarding exercise for patients with type 2 diabetes: a pilot study.,2024,BMJ health & care informatics,,,,"OBJECTIVES: We assessed the feasibility of ChatGPT for patients with type 2 diabetes seeking information about exercise. METHODS: In this pilot study, two physicians with expertise in diabetes care and rehabilitative treatment in Republic of Korea discussed and determined the 14 most asked questions on exercise for managing type 2 diabetes by patients in clinical practice. Each question was inputted into ChatGPT (V.4.0), and the answers from ChatGPT were assessed. The Likert scale was calculated for each category of validity (1-4), safety (1-4) and utility (1-4) based on position statements of the American Diabetes Association and American College of Sports Medicine. RESULTS: Regarding validity, 4 of 14 ChatGPT (28.6%) responses were scored as 3, indicating accurate but incomplete information. The other 10 responses (71.4%) were scored as 4, indicating complete accuracy with complete information. Safety and utility scored 4 (no danger and completely useful) for all 14 ChatGPT responses. CONCLUSION: ChatGPT can be used as supplementary educational material for diabetic exercise. However, users should be aware that ChatGPT may provide incomplete answers to some questions on exercise for type 2 diabetes.",Chung SM; Chang MC
39583920,ChatGPT in medical writing: A game-changer or a gimmick?,2024,Perspectives in clinical research,,,,"OpenAI's ChatGPT (Generative Pre-trained Transformer) is a chatbot that answers questions and performs writing tasks in a conversational tone. Within months of release, multiple sectors are contemplating the varied applications of this chatbot, including medicine, education, and research, all of which are involved in medical communication and scientific publishing. Medical writers and academics use several artificial intelligence (AI) tools and software for research, literature survey, data analyses, referencing, and writing. There are benefits of using different AI tools in medical writing. However, using chatbots for medical communications pose some major concerns such as potential inaccuracies, data bias, security, and ethical issues. Perceived incorrect notions also limit their use. Moreover, ChatGPT can also be challenging if used incorrectly and for irrelevant tasks. If used appropriately, ChatGPT will not only upgrade the knowledge of the medical writer but also save time and energy that could be directed toward more creative and analytical areas requiring expert skill sets. This review introduces chatbots, outlines the progress in ChatGPT research, elaborates the potential uses of ChatGPT in medical communications along with its challenges and limitations, and proposes future research perspectives. It aims to provide guidance for doctors, researchers, and medical writers on the uses of ChatGPT in medical communications.",Ahaley SS; Pandey A; Juneja SK; Gupta TS; Vijayakumar S
38611686,Artificial Intelligence in Medical Imaging: Analyzing the Performance of ChatGPT and Microsoft Bing in Scoliosis Detection and Cobb Angle Assessment.,2024,"Diagnostics (Basel, Switzerland)",,,,"Open-source artificial intelligence models (OSAIM) find free applications in various industries, including information technology and medicine. Their clinical potential, especially in supporting diagnosis and therapy, is the subject of increasingly intensive research. Due to the growing interest in artificial intelligence (AI) for diagnostic purposes, we conducted a study evaluating the capabilities of AI models, including ChatGPT and Microsoft Bing, in the diagnosis of single-curve scoliosis based on posturographic radiological images. Two independent neurosurgeons assessed the degree of spinal deformation, selecting 23 cases of severe single-curve scoliosis. Each posturographic image was separately implemented onto each of the mentioned platforms using a set of formulated questions, starting from 'What do you see in the image?' and ending with a request to determine the Cobb angle. In the responses, we focused on how these AI models identify and interpret spinal deformations and how accurately they recognize the direction and type of scoliosis as well as vertebral rotation. The Intraclass Correlation Coefficient (ICC) with a 'two-way' model was used to assess the consistency of Cobb angle measurements, and its confidence intervals were determined using the F test. Differences in Cobb angle measurements between human assessments and the AI ChatGPT model were analyzed using metrics such as RMSEA, MSE, MPE, MAE, RMSLE, and MAPE, allowing for a comprehensive assessment of AI model performance from various statistical perspectives. The ChatGPT model achieved 100% effectiveness in detecting scoliosis in X-ray images, while the Bing model did not detect any scoliosis. However, ChatGPT had limited effectiveness (43.5%) in assessing Cobb angles, showing significant inaccuracy and discrepancy compared to human assessments. This model also had limited accuracy in determining the direction of spinal curvature, classifying the type of scoliosis, and detecting vertebral rotation. Overall, although ChatGPT demonstrated potential in detecting scoliosis, its abilities in assessing Cobb angles and other parameters were limited and inconsistent with expert assessments. These results underscore the need for comprehensive improvement of AI algorithms, including broader training with diverse X-ray images and advanced image processing techniques, before they can be considered as auxiliary in diagnosing scoliosis by specialists.",Fabijan A; Zawadzka-Fabijan A; Fabijan R; Zakrzewski K; Nowoslawska E; Polis B
38138922,Artificial Intelligence in Scoliosis Classification: An Investigation of Language-Based Models.,2023,Journal of personalized medicine,,,,"Open-source artificial intelligence models are finding free application in various industries, including computer science and medicine. Their clinical potential, especially in assisting diagnosis and therapy, is the subject of increasingly intensive research. Due to the growing interest in AI for diagnostics, we conducted a study evaluating the abilities of AI models, including ChatGPT, Microsoft Bing, and Scholar AI, in classifying single-curve scoliosis based on radiological descriptions. Fifty-six posturographic images depicting single-curve scoliosis were selected and assessed by two independent neurosurgery specialists, who classified them as mild, moderate, or severe based on Cobb angles. Subsequently, descriptions were developed that accurately characterized the degree of spinal deformation, based on the measured values of Cobb angles. These descriptions were then provided to AI language models to assess their proficiency in diagnosing spinal pathologies. The artificial intelligence models conducted classification using the provided data. Our study also focused on identifying specific sources of information and criteria applied in their decision-making algorithms, aiming for a deeper understanding of the determinants influencing AI decision processes in scoliosis classification. The classification quality of the predictions was evaluated using performance evaluation metrics such as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and balanced accuracy. Our study strongly supported our hypothesis, showing that among four AI models, ChatGPT 4 and Scholar AI Premium excelled in classifying single-curve scoliosis with perfect sensitivity and specificity. These models demonstrated unmatched rater concordance and excellent performance metrics. In comparing real and AI-generated scoliosis classifications, they showed impeccable precision in all posturographic images, indicating total accuracy (1.0, MAE = 0.0) and remarkable inter-rater agreement, with a perfect Fleiss' Kappa score. This was consistent across scoliosis cases with a Cobb's angle range of 11-92 degrees. Despite high accuracy in classification, each model used an incorrect angular range for the mild stage of scoliosis. Our findings highlight the immense potential of AI in analyzing medical data sets. However, the diversity in competencies of AI models indicates the need for their further development to more effectively meet specific needs in clinical practice.",Fabijan A; Polis B; Fabijan R; Zakrzewski K; Nowoslawska E; Zawadzka-Fabijan A
37830256,Beyond ChatGPT: What does GPT-4 add to healthcare? The dawn of a new era.,2023,Cardiology journal,,,,"Over the past few years, artificial intelligence (AI) has significantly improved healthcare. Once the stuff of science fiction, AI is now widely used, even in our daily lives - often without us thinking about it. All healthcare professionals - especially executives and medical doctors - need to understand the capabilities of advanced AI tools and other breakthrough innovations. This understanding will allow them to recognize opportunities and threats emerging technologies can bring to their organizations. We hope to contribute to a meaningful public discussion about the role of this new type of AI and how our approach to healthcare and medicine can best evolve with the rapid development of this technology. Since medicine learns by example, only a few possible uses of AI in medicine are provided, which merely outline the system's capabilities. Among the examples, it is worth highlighting the roles of AI in medical notes, education, preventive programs, consultation, triage and intervention. It is believed by the authors that large language models such as chat generative pre-trained transformer (ChatGPT) are reaching a level of maturity that will soon impact clinical medicine as a whole and improve the delivery of individualized, compassionate, and scalable healthcare. It is unlikely that AI will replace physicians in the near future. The human aspects of care, including empathy, compassion, critical thinking, and complex decision-making, are invaluable in providing holistic patient care beyond diagnosis and treatment decisions. The GPT-4 has many limitations and cannot replace direct contact between an experienced physician and a patient for even the most seemingly simple consultations, not to mention the ethical and legal aspects of responsibility for diagnosis.",Wojcik S; Rulkiewicz A; Pruszczyk P; Lisik W; Pobozy M; Domienik-Karlowicz J
37726551,Using ChatGPT to predict the future of personalized medicine.,2023,The pharmacogenomics journal,,,,"Personalized medicine is a novel frontier in health care that is based on each person's unique genetic makeup. It represents an exciting opportunity to improve the future of individualized health care for all individuals. Pharmacogenomics, as the main part of personalized medicine, aims to optimize and create a more targeted treatment approach based on genetic variations in drug response. It is predicted that future treatments will be algorithm-based instead of evidence-based that will consider a patient's genetic, transcriptomic, proteomic, epigenetic, and lifestyle factors resulting in individualized medication. A generative pretrained transformer (GPT) is an artificial intelligence (AI) tool that generates language resembling human-like writing enabling users to engage in a manner that is practically identical to speaking with a human being. GPT's predictive algorithms can respond to questions that have never been addressed. Chat Generative Pretrained Transformer (ChatGPT) is an AI chatbot's advanced with conversational capabilities. In the present study, questions were asked from ChatGPT about the future of personalized medicine and pharmacogenomics. ChatGPT predicted both to be a promising approach with a bright future that holds great promises in improving patient outcomes and transforming the field of medicine. But it still has several limitations that need to be solved.",Patrinos GP; Sarhangi N; Sarrami B; Khodayari N; Larijani B; Hasanzad M
40200464,Minimizing STOPP and Beers Criteria Risks in PIM Treatments Using PM-TOM and ChatGPT: A Case Study.,2025,Studies in health technology and informatics,,,,"PM-TOM (Personalized Medicine-Therapy Optimization Method) is a clinical decision-support tool designed to optimize polypharmacy treatments by minimizing their adverse drug reactions (ADRs) caused by individual drugs or drug interactions (DDIs, DCIs, DFIs, DGIs), along with the risks identified by the STOPP and Beers criteria. On the other hand, AI tools like ChatGPT 4.0, trained on medical literature texts, can provide broader clinical reasoning and insights tailored to individual patient contexts. By referring to a documented deprescribing case, this study demonstrates the synergistic power of PM-TOM and ChatGPT in optimizing potentially inappropriate medication (PIM) treatments. A malnourished older woman was admitted to a deprescribing facility with recurrent falls, hypertension, ischemic heart disease, depression, osteoarthritis, osteoporosis, and GERD. She was initially prescribed acetaminophen, alendronate, omeprazole, lisinopril, metoprolol, aspirin, citalopram, and vitamin D, which were assessed as inadequate. While the discharge regimen improved some conditions by replacing alendronate with zoledronic acid and reducing some drug dosages, PM-TOM revealed that key risks, stemming primarily from omeprazole, aspirin, and citalopram, remained unaddressed. The discharge treatment was optimized with PM-TOM after considering alternative drug classes suggested by ChatGPT and elaborated in the available medical literature. In the optimized treatment, omeprazole (PPI) was replaced with famotidine (H2-blocker), citalopram (SSRI) with agomelatine (atypical antidepressant), zoledronic acid (bisphosphonate) with denosumab (RANK ligand inhibitor), aspirin (NSAID) with ticagrelor (antiplatelet), and lisinopril with benazepril (ACE inhibitor). These changes significantly reduced possible ADRs and the geriatric care criteria risks. Finally, ChatGPT validated the proposed adjustments, confirming their alignment with the guidelines and highlighting the potential for longer-term benefits. This case study illustrates how a combined use of PM-TOM and AI tools can effectively support the clinical decision-making process by optimizing polypharmacy treatments and minimizing their PIMs, major contributors to morbidity in older adults and high healthcare costs.",Kulenovic A; Lagumdzija-Kulenovic A
39319818,ChatGPT's Role in Improving Education Among Patients Seeking Emergency Medical Treatment.,2024,The western journal of emergency medicine,,,,"Providing appropriate patient education during a medical encounter remains an important area for improvement across healthcare settings. Personalized resources can offer an impactful way to improve patient understanding and satisfaction during or after a healthcare visit. ChatGPT is a novel chatbot-computer program designed to simulate conversation with humans- that has the potential to assist with care-related questions, clarify discharge instructions, help triage medical problem urgency, and could potentially be used to improve patient-clinician communication. However, due to its training methodology, ChatGPT has inherent limitations, including technical restrictions, risk of misinformation, lack of input standardization, and privacy concerns. Medicolegal liability also remains an open question for physicians interacting with this technology. Nonetheless, careful utilization of ChatGPT in clinical medicine has the potential to supplement patient education in important ways.",Halaseh FF; Yang JS; Danza CN; Halaseh R; Spiegelman L
37704461,Battle of the (Chat)Bots: Comparing Large Language Models to Practice Guidelines for Transfusion-Associated Graft-Versus-Host Disease Prevention.,2023,Transfusion medicine reviews,,,,"Published guidelines and clinical practices vary when defining indications for irradiation of blood components for the prevention of transfusion-associated graft-versus-host disease (TA-GVHD). This study assessed irradiation indication lists generated by multiple artificial intelligence (AI) programs, or chatbots, and compared them to 2020 British Society for Haematology (BSH) practice guidelines. Four chatbots (ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat) were prompted to list the indications for irradiation to prevent TA-GVHD. Responses were graded for concordance with BSH guidelines. Chatbot response length, discrepancies, and omissions were noted. Chatbot responses differed, but all were relevant, short in length, generally more concordant than discordant with BSH guidelines, and roughly complete. They lacked several indications listed in BSH guidelines and notably differed in their irradiation eligibility criteria for fetuses and neonates. The chatbots variably listed erroneous indications for TA-GVHD prevention, such as patients receiving blood from a donor who is of a different race or ethnicity. This study demonstrates the potential use of generative AI for transfusion medicine and hematology topics but underscores the risk of chatbot medical misinformation. Further study of risk factors for TA-GVHD, as well as the applications of chatbots in transfusion medicine and hematology, is warranted.",Stephens LD; Jacobs JW; Adkins BD; Booth GS
37885556,Bias and Inaccuracy in AI Chatbot Ophthalmologist Recommendations.,2023,Cureus,,,,"PURPOSE AND DESIGN: To evaluate the accuracy and bias of ophthalmologist recommendations made by three AI chatbots, namely ChatGPT 3.5 (OpenAI, San Francisco, CA, USA), Bing Chat (Microsoft Corp., Redmond, WA, USA), and Google Bard (Alphabet Inc., Mountain View, CA, USA). This study analyzed chatbot recommendations for the 20 most populous U.S. cities. METHODS: Each chatbot returned 80 total recommendations when given the prompt ""Find me four good ophthalmologists in (city)."" Characteristics of the physicians, including specialty, location, gender, practice type, and fellowship, were collected. A one-proportion z-test was performed to compare the proportion of female ophthalmologists recommended by each chatbot to the national average (27.2% per the Association of American Medical Colleges (AAMC)). Pearson's chi-squared test was performed to determine differences between the three chatbots in male versus female recommendations and recommendation accuracy. RESULTS: Female ophthalmologists recommended by Bing Chat (1.61%) and Bard (8.0%) were significantly less than the national proportion of 27.2% practicing female ophthalmologists (p<0.001, p<0.01, respectively). ChatGPT recommended fewer female (29.5%) than male ophthalmologists (p<0.722). ChatGPT (73.8%), Bing Chat (67.5%), and Bard (62.5%) gave high rates of inaccurate recommendations. Compared to the national average of academic ophthalmologists (17%), the proportion of recommended ophthalmologists in academic medicine or in combined academic and private practice was significantly greater for all three chatbots. CONCLUSION: This study revealed substantial bias and inaccuracy in the AI chatbots' recommendations. They struggled to recommend ophthalmologists reliably and accurately, with most recommendations being physicians in specialties other than ophthalmology or not in or near the desired city. Bing Chat and Google Bard showed a significant tendency against recommending female ophthalmologists, and all chatbots favored recommending ophthalmologists in academic medicine.",Oca MC; Meller L; Wilson K; Parikh AO; McCoy A; Chang J; Sudharshan R; Gupta S; Zhang-Nunes S
39347335,Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.,2024,Cureus,,,,"Purpose Artificial intelligence (AI) has rapidly gained popularity with the growth of ChatGPT (OpenAI, San Francisco, USA) and other large-language model chatbots, and these programs have tremendous potential to impact medicine. One important area of consequence in medicine and public health is that patients may use these programs in search of answers to medical questions. Despite the increased utilization of AI chatbots by the public, there is little research to assess the reliability of ChatGPT and alternative programs when queried for medical information. This study seeks to elucidate the accuracy and readability of AI chatbots in answering patient questions regarding urology. As vasectomy is one of the most common urologic procedures, this study investigates AI-generated responses to frequently asked vasectomy-related questions. For this study, five popular and free-to-access AI platforms were utilized to undertake this investigation. Methods Fifteen vasectomy-related questions were individually queried to five AI chatbots from November-December 2023: ChatGPT (OpenAI, San Francisco, USA), Bard (Google Inc., Mountainview, USA) Bing (Microsoft, Redmond, USA) Perplexity (Perplexity AI Inc., San Francisco, USA), and Claude (Anthropic, San Francisco, USA). Responses from each platform were graded by two attending urologists, two urology research faculty, and one urological resident physician using a Likert (1-6) scale: (1-completely inaccurate, 6-completely accurate) based on comparison to existing American Urological Association guidelines. Flesch-Kincaid Grade levels (FKGL) and Flesch Reading Ease scores (FRES) (1-100) were calculated for each response. To assess differences in Likert, FRES, and FKGL, Kruskal-Wallis tests were performed using GraphPad Prism V10.1.0 (GraphPad, San Diego, USA) with Alpha set at 0.05. Results Analysis shows that ChatGPT provided the most accurate responses across the five AI chatbots with an average score of 5.04 on the Likert scale. Subsequently, Microsoft Bing (4.91), Anthropic Claude (4.65), Google Bard (4.43), and Perplexity (4.41) followed. All five chatbots were found to score, on average, higher than 4.41 corresponding to a score of at least ""somewhat accurate."" Google Bard received the highest Flesch Reading Ease score (49.67) and lowest Grade level (10.1) when compared to the other chatbots. Anthropic Claude scored 46.7 on the FRES and 10.55 on the FKGL. Microsoft Bing scored 45.57 on the FRES and 11.56 on the FKGL. Perplexity scored 36.4 on the FRES and 13.29 on the FKGL. ChatGPT had the lowest FRES of 30.4 and highest FKGL of 14.2. Conclusion This study investigates the use of AI in medicine, specifically urology, and it helps to determine whether large-language model chatbots can be reliable sources of freely available medical information. All five AI chatbots on average were able to achieve at least ""somewhat accurate"" on a 6-point Likert scale. In terms of readability, all five AI chatbots on average had Flesch Reading Ease scores of less than 50 and were higher than a 10th-grade level. In this small-scale study, there were several significant differences identified between the readability scores of each AI chatbot. However, there were no significant differences found among their accuracies. Thus, our study suggests that major AI chatbots may perform similarly in their ability to be correct but differ in their ease of being comprehended by the general public.",Carlson JA; Cheng RZ; Lange A; Nagalakshmi N; Rabets J; Shah T; Sindhwani P
39469383,"Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.",2024,Cureus,,,,"Purpose Large language models (LLMs) are increasingly employed across various fields, including medicine and dentistry. In the field of dental anesthesiology, LLM is expected to enhance the efficiency of information gathering, patient outcomes, and education. This study evaluates the performance of different LLMs in answering questions from the Japanese Dental Society of Anesthesiology Board Certification Examination (JDSABCE) to determine their utility in dental anesthesiology. Methods The study assessed three LLMs, ChatGPT-4 (OpenAI, San Francisco, California, United States), Gemini 1.0 (Google, Mountain View, California, United States), and Claude 3 Opus (Anthropic, San Francisco, California, United States), using multiple-choice questions from the 2020 to 2022 JDSABCE exams. Each LLM answered these questions three times. The study excluded questions involving figures or deemed inappropriate. The primary outcome was the accuracy rate of each LLM, with secondary analysis focusing on six subgroups: (1) basic physiology necessary for general anesthesia, (2) local anesthesia, (3) sedation and general anesthesia, (4) diseases and patient management methods that pose challenges in systemic management, (5) pain management, and (6) shock and cardiopulmonary resuscitation. Statistical analysis was performed using one-way ANOVA with Dunnett's multiple comparisons, with a significance threshold of p<0.05. Results ChatGPT-4 achieved a correct answer rate of 51.2% (95% CI: 42.78-60.56, p=0.003) and Claude 3 Opus 47.4% (95% CI: 43.45-51.44, p<0.001), both significantly higher than Gemini 1.0, which had a rate of 30.3% (95% CI: 26.53-34.14). In subgroup analyses, ChatGPT-4 and Claude 3 Opus demonstrated superior performance in basic physiology, sedation and general anesthesia, and systemic management challenges compared to Gemini 1.0. Notably, ChatGPT-4 excelled in questions related to systemic management (62.5%) and Claude 3 Opus in pain management (61.53%). Conclusions ChatGPT-4 and Claude 3 Opus exhibit potential for use in dental anesthesiology, outperforming Gemini 1.0. However, their current accuracy rates are insufficient for reliable clinical use. These findings have significant implications for dental anesthesiology practice and education, including educational support, clinical decision support, and continuing education. To enhance LLM utility in dental anesthesiology, it is crucial to increase the availability of high-quality information online and refine prompt engineering to better guide LLM responses.",Fujimoto M; Kuroda H; Katayama T; Yamaguchi A; Katagiri N; Kagawa K; Tsukimoto S; Nakano A; Imaizumi U; Sato-Boku A; Kishimoto N; Itamiya T; Kido K; Sanuki T
37962176,ChatGPT in urology practice: revolutionizing efficiency and patient care with generative artificial intelligence.,2024,Current opinion in urology,,,,"PURPOSE OF REVIEW: ChatGPT has emerged as a potentially useful tool for healthcare. Its role in urology is in its infancy and has much potential for research, clinical practice and for patient assistance. With this narrative review, we want to draw a picture of what is known about ChatGPT's integration in urology, alongside future promises and challenges. RECENT FINDINGS: The use of ChatGPT can ease the administrative work, helping urologists with note-taking and clinical documentation such as discharge summaries and clinical notes. It can improve patient engagement through increasing awareness and facilitating communication, as it has especially been investigated for uro-oncological diseases. Its ability to understand human emotions makes ChatGPT an empathic and thoughtful interactive tool or source for urological patients and their relatives. Currently, its role in clinical diagnosis and treatment decisions is uncertain, as concerns have been raised about misinterpretation, hallucination and out-of-date information. Moreover, a mandatory regulatory process for ChatGPT in urology is yet to be established. SUMMARY: ChatGPT has the potential to contribute to precision medicine and tailored practice by its quick, structured responses. However, this will depend on how well information can be obtained by seeking appropriate responses and asking the pertinent questions. The key lies in being able to validate the responses, regulating the information shared and avoiding misuse of the same to protect the data and patient privacy. Its successful integration into mainstream urology needs educational bodies to provide guidelines or best practice recommendations for the same.",Nedbal C; Naik N; Castellani D; Gauhar V; Geraghty R; Somani BK
38976174,Application of Artificial Intelligence in the Headache Field.,2024,Current pain and headache reports,,,,"PURPOSE OF REVIEW: Headache disorders are highly prevalent worldwide. Rapidly advancing capabilities in artificial intelligence (AI) have expanded headache-related research with the potential to solve unmet needs in the headache field. We provide an overview of AI in headache research in this article. RECENT FINDINGS: We briefly introduce machine learning models and commonly used evaluation metrics. We then review studies that have utilized AI in the field to advance diagnostic accuracy and classification, predict treatment responses, gather insights from various data sources, and forecast migraine attacks. Furthermore, given the emergence of ChatGPT, a type of large language model (LLM), and the popularity it has gained, we also discuss how LLMs could be used to advance the field. Finally, we discuss the potential pitfalls, bias, and future directions of employing AI in headache medicine. Many recent studies on headache medicine incorporated machine learning, generative AI and LLMs. A comprehensive understanding of potential pitfalls and biases is crucial to using these novel techniques with minimum harm. When used appropriately, AI has the potential to revolutionize headache medicine.",Ihara K; Dumkrieger G; Zhang P; Takizawa T; Schwedt TJ; Chiang CC
37729050,Large language models and the future of rheumatology: assessing impact and emerging opportunities.,2024,Current opinion in rheumatology,,,,"PURPOSE OF REVIEW: Large language models (LLMs) have grown rapidly in size and capabilities as more training data and compute power has become available. Since the release of ChatGPT in late 2022, there has been growing interest and exploration around potential applications of LLM technology. Numerous examples and pilot studies demonstrating the capabilities of these tools have emerged across several domains. For rheumatology professionals and patients, LLMs have the potential to transform current practices in medicine. RECENT FINDINGS: Recent studies have begun exploring capabilities of LLMs that can assist rheumatologists in clinical practice, research, and medical education, though applications are still emerging. In clinical settings, LLMs have shown promise in assist healthcare professionals enabling more personalized medicine or generating routine documentation like notes and letters. Challenges remain around integrating LLMs into clinical workflows, accuracy of the LLMs and ensuring patient data confidentiality. In research, early experiments demonstrate LLMs can offer analysis of datasets, with quality control as a critical piece. Lastly, LLMs could supplement medical education by providing personalized learning experiences and integration into established curriculums. SUMMARY: As these powerful tools continue evolving at a rapid pace, rheumatology professionals should stay informed on how they may impact the field.",Mannstadt I; Mehta B
38060133,Machine Learning and Artificial Intelligence Applications to Epilepsy: a Review for the Practicing Epileptologist.,2023,Current neurology and neuroscience reports,,,,"PURPOSE OF REVIEW: Machine Learning (ML) and Artificial Intelligence (AI) are data-driven techniques to translate raw data into applicable and interpretable insights that can assist in clinical decision making. Some of these tools have extremely promising initial results, earning both great excitement and creating hype. This non-technical article reviews recent developments in ML/AI in epilepsy to assist the current practicing epileptologist in understanding both the benefits and limitations of integrating ML/AI tools into their clinical practice. RECENT FINDINGS: ML/AI tools have been developed to assist clinicians in almost every clinical decision including (1) predicting future epilepsy in people at risk, (2) detecting and monitoring for seizures, (3) differentiating epilepsy from mimics, (4) using data to improve neuroanatomic localization and lateralization, and (5) tracking and predicting response to medical and surgical treatments. We also discuss practical, ethical, and equity considerations in the development and application of ML/AI tools including chatbots based on Large Language Models (e.g., ChatGPT). ML/AI tools will change how clinical medicine is practiced, but, with rare exceptions, the transferability to other centers, effectiveness, and safety of these approaches have not yet been established rigorously. In the future, ML/AI will not replace epileptologists, but epileptologists with ML/AI will replace epileptologists without ML/AI.",Kerr WT; McFarlane KN
38277274,Applications of artificial intelligence-enabled robots and chatbots in ophthalmology: recent advances and future trends.,2024,Current opinion in ophthalmology,,,,"PURPOSE OF REVIEW: Recent advances in artificial intelligence (AI), robotics, and chatbots have brought these technologies to the forefront of medicine, particularly ophthalmology. These technologies have been applied in diagnosis, prognosis, surgical operations, and patient-specific care in ophthalmology. It is thus both timely and pertinent to assess the existing landscape, recent advances, and trajectory of trends of AI, AI-enabled robots, and chatbots in ophthalmology. RECENT FINDINGS: Some recent developments have integrated AI enabled robotics with diagnosis, and surgical procedures in ophthalmology. More recently, large language models (LLMs) like ChatGPT have shown promise in augmenting research capabilities and diagnosing ophthalmic diseases. These developments may portend a new era of doctor-patient-machine collaboration. SUMMARY: Ophthalmology is undergoing a revolutionary change in research, clinical practice, and surgical interventions. Ophthalmic AI-enabled robotics and chatbot technologies based on LLMs are converging to create a new era of digital ophthalmology. Collectively, these developments portend a future in which conventional ophthalmic knowledge will be seamlessly integrated with AI to improve the patient experience and enhance therapeutic outcomes.",Madadi Y; Delsoz M; Khouri AS; Boland M; Grzybowski A; Yousefi S
38032442,An Introduction to Generative Artificial Intelligence in Mental Health Care: Considerations and Guidance.,2023,Current psychiatry reports,,,,"PURPOSE OF REVIEW: This paper provides an overview of generative artificial intelligence (AI) and the possible implications in the delivery of mental health care. RECENT FINDINGS: Generative AI is a powerful technology that is changing rapidly. As psychiatrists, it is important for us to understand generative AI technology and how it may impact our patients and our practice of medicine. This paper aims to build this understanding by focusing on GPT-4 and its potential impact on mental health care delivery. We first introduce key concepts and terminology describing how the technology works and various novel uses of it. We then dive into key considerations for GPT-4 and other large language models (LLMs) and wrap up with suggested future directions and initial guidance to the field.",King DR; Nanda G; Stoddard J; Dempsey A; Hergert S; Shore JH; Torous J
39450858,A comparison of drug information question responses by a drug information center and by ChatGPT.,2025,American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists,,,,"PURPOSE: A study was conducted to assess the accuracy and ability of Chat Generative Pre-trained Transformer (ChatGPT) to systematically respond to drug information inquiries relative to responses of a drug information center (DIC). METHODS: Ten drug information questions answered by the DIC in 2022 or 2023 were selected for analysis. Three pharmacists created new ChatGPT accounts and submitted each question to ChatGPT at the same time. Each question was submitted twice to identify consistency in responses. Two days later, the same process was conducted by a fourth pharmacist. Phase 1 of data analysis consisted of a drug information pharmacist assessing all 84 ChatGPT responses for accuracy relative to the DIC responses. In phase 2, 10 ChatGPT responses were selected to be assessed by 3 blinded reviewers. Reviewers utilized an 8-question predetermined rubric to evaluate the ChatGPT and DIC responses. RESULTS: When comparing the ChatGPT responses (n = 84) to the DIC responses, ChatGPT had an overall accuracy rate of 50%. Accuracy across the different question types varied. In regards to the overall blinded score, ChatGPT responses scored higher than the responses by the DIC according to the rubric (overall scores of 67.5% and 55.0%, respectively). The DIC responses scored higher in the categories of references mentioned and references identified. CONCLUSION: Responses generated by ChatGPT have been found to be better than those created by a DIC in clarity and readability; however, the accuracy of ChatGPT responses was lacking. ChatGPT responses to drug information questions would need to be carefully reviewed for accuracy and completeness.",Triplett S; Ness-Engle GL; Behnen EM
38648540,How do we teach generative artificial intelligence to medical educators? Pilot of a faculty development workshop using ChatGPT.,2025,Medical teacher,,,,"PURPOSE: Artificial intelligence (AI) is already impacting the practice of medicine and it is therefore important for future healthcare professionals and medical educators to gain experience with the benefits, limitations, and applications of this technology. The purpose of this project was to develop, implement, and evaluate a faculty development workshop on generative AI using ChatGPT, to familiarise participants with AI. MATERIALS AND METHODS: A brief workshop introducing faculty to generative AI and its applications in medical education was developed for preclinical clinical skills preceptors at our institution. During the workshop, faculty were given prompts to enter into ChatGPT that were relevant to their teaching activities, including generating differential diagnoses and providing feedback on student notes. Participant feedback was collected using an anonymous survey. RESULTS: 27/36 participants completed the survey. Prior to the workshop, 15% of participants indicated having used ChatGPT, and approximately half were familiar with AI applications in medical education. Interest in using the tool increased from 43% to 65% following the workshop, yet participants expressed concerns regarding accuracy and privacy with use of ChatGPT. CONCLUSION: This brief workshop serves as a model for faculty development in AI applications in medical education. The workshop increased interest in using ChatGPT for educational purposes, and was well received.",Chadha N; Popil E; Gregory J; Armstrong-Davies L; Justin G
40125532,Integrating ChatGPT as a Tool in Pharmacy Practice: A Cross-Sectional Exploration Among Pharmacists in Saudi Arabia.,2025,Integrated pharmacy research & practice,,,,"PURPOSE: Artificial Intelligence (AI), especially ChatGPT, is rapidly assimilating into healthcare, providing significant advantages in pharmacy practice, such as improved clinical decision-making, patient counselling, and drug information management. The adoption of AI tools is heavily contingent upon pharmacy practitioners' knowledge, attitudes, and practices (KAP). This study sought to evaluate the knowledge and practices of pharmacists in Saudi Arabia concerning the utilization of ChatGPT in their daily activities. PATIENTS AND METHODS: A cross-sectional study was performed from May 2023 to July 2024 including pharmacists in Riyadh, Saudi Arabia. An online pre-validated KAP questionnaire was disseminated, collecting data on demographics, knowledge, attitudes, and practices about ChatGPT. Descriptive statistics and regression analyses were conducted using SPSS. RESULTS: Of 1022 respondents, 78.7% were familiar with AI in pharmacy, while 90.1% correctly identified ChatGPT as an advanced AI chatbot. Positive attitudes towards ChatGPT were reported by 64.1% of pharmacists, although only 24.3% used AI tools regularly. Significant predictors of positive attitudes and practices included academic/research roles (beta=0.7, p=0.005) and 6-10 years of experience (beta=0.9, p=0.05). Ethical concerns were raised by 64% of respondents, and 92% reported a lack of formal training. CONCLUSION: While the majority of pharmacists held positive attitudes toward ChatGPT, practical implementation remains limited due to ethical concerns and inadequate training. Addressing these barriers is essential for successful AI integration in pharmacy, supporting Saudi Arabia's Vision 2030 initiative.",Alghitran A; AlOsaimi HM; Albuluwi A; Almalki EO; Aldowayan AZ; Alharthi R; Qattan JM; Alghamdi F; AlHalabi M; Almalki NA; Alharthi A; Alshammari A; Kanan M
39494261,Evaluating the Accuracy of Large Language Model (ChatGPT) in Providing Information on Metastatic Breast Cancer.,2024,Advanced pharmaceutical bulletin,,,,"PURPOSE: Artificial intelligence (AI), particularly large language models like ChatGPT developed by OpenAI, has demonstrated potential in various domains, including medicine. While ChatGPT has shown the capability to pass rigorous exams like the United States Medical Licensing Examination (USMLE) Step 1, its proficiency in addressing breast cancer-related inquiries-a complex and prevalent disease-remains underexplored. This study aims to assess the accuracy and comprehensiveness of ChatGPT's responses to common breast cancer questions, addressing a critical gap in the literature and evaluating its potential in enhancing patient education and support in breast cancer management. METHODS: A curated list of 100 frequently asked breast cancer questions was compiled from Cancer.net, the National Breast Cancer Foundation, and clinical practice. These questions were input into ChatGPT, and the responses were evaluated for accuracy by two primary experts using a four-point scale. Discrepancies in scoring were resolved through additional expert review. RESULTS: Of the 100 responses, 5 were entirely inaccurate, 22 partially accurate, 42 accurate but lacking comprehensiveness, and 31 highly accurate. The majority of the responses were found to be at least partially accurate, demonstrating ChatGPT's potential in providing reliable information on breast cancer. CONCLUSION: ChatGPT shows promise as a supplementary tool for patient education on breast cancer. While generally accurate, the presence of inaccuracies underscores the need for professional oversight. The study advocates for integrating AI tools like ChatGPT in healthcare settings to support patient-provider interactions and health education, emphasizing the importance of regular updates to reflect the latest research and clinical guidelines.",Gummadi R; Dasari N; Kumar DS; Pindiprolu SKSS
37578849,Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.,2024,Canadian Association of Radiologists journal = Journal l'Association canadienne des radiologistes,,,,"PURPOSE: Bard by Google, a direct competitor to ChatGPT, was recently released. Understanding the relative performance of these different chatbots can provide important insight into their strengths and weaknesses as well as which roles they are most suited to fill. In this project, we aimed to compare the most recent version of ChatGPT, ChatGPT-4, and Bard by Google, in their ability to accurately respond to radiology board examination practice questions. METHODS: Text-based questions were collected from the 2017-2021 American College of Radiology's Diagnostic Radiology In-Training (DXIT) examinations. ChatGPT-4 and Bard were queried, and their comparative accuracies, response lengths, and response times were documented. Subspecialty-specific performance was analyzed as well. RESULTS: 318 questions were included in our analysis. ChatGPT answered significantly more accurately than Bard (87.11% vs 70.44%, P < .0001). ChatGPT's response length was significantly shorter than Bard's (935.28 +/- 440.88 characters vs 1437.52 +/- 415.91 characters, P < .0001). ChatGPT's response time was significantly longer than Bard's (26.79 +/- 3.27 seconds vs 7.55 +/- 1.88 seconds, P < .0001). ChatGPT performed superiorly to Bard in neuroradiology, (100.00% vs 86.21%, P = .03), general & physics (85.39% vs 68.54%, P < .001), nuclear medicine (80.00% vs 56.67%, P < .01), pediatric radiology (93.75% vs 68.75%, P = .03), and ultrasound (100.00% vs 63.64%, P < .001). In the remaining subspecialties, there were no significant differences between ChatGPT and Bard's performance. CONCLUSION: ChatGPT displayed superior radiology knowledge compared to Bard. While both chatbots display reasonable radiology knowledge, they should be used with conscious knowledge of their limitations and fallibility. Both chatbots provided incorrect or illogical answer explanations and did not always address the educational content of the question.",Patil NS; Huang RS; van der Pol CB; Larocque N
39348666,"Informatics and Artificial Intelligence-Guided Assessment of the Regulatory and Translational Research Landscape of First-in-Class Oncology Drugs in the United States, 2018-2022.",2024,JCO clinical cancer informatics,,,,"PURPOSE: Cancer drug development remains a critical but challenging process that affects millions of patients and their families. Using biomedical informatics and artificial intelligence (AI) approaches, we assessed the regulatory and translational research landscape defining successful first-in-class drugs for patients with cancer. METHODS: This is a retrospective observational study of all novel first-in-class drugs approved by the US Food and Drug Administration (FDA) from 2018 to 2022, stratified by cancer versus noncancer drugs. A biomedical informatics pipeline leveraging interoperability standards and ChatGPT performed integration and analysis of public databases provided by the FDA, National Institutes of Health, and WHO. RESULTS: Between 2018 and 2022, the FDA approved a total of 247 novel drugs, of which 107 (43.3%) were first-in-class drugs involving a new biologic target. Of these first-in-class drugs, 30 (28%) treatments were indicated for patients with cancer, including 19 (63.3%) for solid tumors and the remaining 11 (36.7%) for hematologic cancers. A median of 68 publications of basic, clinical, and other relevant translational science preceded successful FDA approval of first-in-class cancer drugs, with oncology-related treatments involving fewer median years of target-based research than therapies not related to cancer (33 v 43 years; P < .05). Overall, 94.4% of first-in-class drugs had at least 25 years of target-related research papers, while 85.5% of first-in-class drugs had at least 10 years of translational research publications. CONCLUSION: Novel first-in-class cancer treatments are defined by diverse clinical indications, personalized molecular targets, dependence on expedited regulatory pathways, and translational research metrics reflecting this complex landscape. Biomedical informatics and AI provide scalable, data-driven ways to assess and even address important challenges in the drug development pipeline.",Ronquillo JG; South B; Naik P; Singh R; De Jesus M; Watt SJ; Habtezion A
38650670,The Application of ChatGPT in Medicine: A Scoping Review and Bibliometric Analysis.,2024,Journal of multidisciplinary healthcare,,,,"PURPOSE: ChatGPT has a wide range of applications in the medical field. Therefore, this review aims to define the key issues and provide a comprehensive view of the literature based on the application of ChatGPT in medicine. METHODS: This scope follows Arksey and O'Malley's five-stage framework. A comprehensive literature search of publications (30 November 2022 to 16 August 2023) was conducted. Six databases were searched and relevant references were systematically catalogued. Attention was focused on the general characteristics of the articles, their fields of application, and the advantages and disadvantages of using ChatGPT. Descriptive statistics and narrative synthesis methods were used for data analysis. RESULTS: Of the 3426 studies, 247 met the criteria for inclusion in this review. The majority of articles (31.17%) were from the United States. Editorials (43.32%) ranked first, followed by experimental studys (11.74%). The potential applications of ChatGPT in medicine are varied, with the largest number of studies (45.75%) exploring clinical practice, including assisting with clinical decision support and providing disease information and medical advice. This was followed by medical education (27.13%) and scientific research (16.19%). Particularly noteworthy in the discipline statistics were radiology, surgery and dentistry at the top of the list. However, ChatGPT in medicine also faces issues of data privacy, inaccuracy and plagiarism. CONCLUSION: The application of ChatGPT in medicine focuses on different disciplines and general application scenarios. ChatGPT has a paradoxical nature: it offers significant advantages, but at the same time raises great concerns about its application in healthcare settings. Therefore, it is imperative to develop theoretical frameworks that not only address its widespread use in healthcare but also facilitate a comprehensive assessment. In addition, these frameworks should contribute to the development of strict and effective guidelines and regulatory measures.",Wu J; Ma Y; Wang J; Xiao M
38866652,Evaluating ChatGPT to test its robustness as an interactive information database of radiation oncology and to assess its responses to common queries from radiotherapy patients: A single institution investigation.,2024,Cancer radiotherapie : journal de la Societe francaise de radiotherapie oncologique,,,,"PURPOSE: Commercial vendors have created artificial intelligence (AI) tools for use in all aspects of life and medicine, including radiation oncology. AI innovations will likely disrupt workflows in the field of radiation oncology. However, limited data exist on using AI-based chatbots about the quality of radiation oncology information. This study aims to assess the accuracy of ChatGPT, an AI-based chatbot, in answering patients' questions during their first visit to the radiation oncology outpatient department and test knowledge of ChatGPT in radiation oncology. MATERIAL AND METHODS: Expert opinion was formulated using a set of ten standard questions of patients encountered in outpatient department practice. A blinded expert opinion was taken for the ten questions on common queries of patients in outpatient department visits, and the same questions were evaluated on ChatGPT version 3.5 (ChatGPT 3.5). The answers by expert and ChatGPT were independently evaluated for accuracy by three scientific reviewers. Additionally, a comparison was made for the extent of similarity of answers between ChatGPT and experts by a response scoring for each answer. Word count and Flesch-Kincaid readability score and grade were done for the responses obtained from expert and ChatGPT. A comparison of the answers of ChatGPT and expert was done with a Likert scale. As a second component of the study, we tested the technical knowledge of ChatGPT. Ten multiple choice questions were framed with increasing order of difficulty - basic, intermediate and advanced, and the responses were evaluated on ChatGPT. Statistical testing was done using SPSS version 27. RESULTS: After expert review, the accuracy of expert opinion was 100%, and ChatGPT's was 80% (8/10) for regular questions encountered in outpatient department visits. A noticeable difference was observed in word count and readability of answers from expert opinion or ChatGPT. Of the ten multiple-choice questions for assessment of radiation oncology database, ChatGPT had an accuracy rate of 90% (9 out of 10). One answer to a basic-level question was incorrect, whereas all answers to intermediate and difficult-level questions were correct. CONCLUSION: ChatGPT provides reasonably accurate information about routine questions encountered in the first outpatient department visit of the patient and also demonstrated a sound knowledge of the subject. The result of our study can inform the future development of educational tools in radiation oncology and may have implications in other medical fields. This is the first study that provides essential insight into the potentially positive capabilities of two components of ChatGPT: firstly, ChatGPT's response to common queries of patients at OPD visits, and secondly, the assessment of the radiation oncology knowledge base of ChatGPT.",Pandey VK; Munshi A; Mohanti BK; Bansal K; Rastogi K
38393353,An introduction to machine learning and generative artificial intelligence for otolaryngologists-head and neck surgeons: a narrative review.,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"PURPOSE: Despite the robust expansion of research surrounding artificial intelligence (AI) and machine learning (ML) and their applications to medicine, these methodologies often remain opaque and inaccessible to many otolaryngologists. Especially, with the increasing ubiquity of large-language models (LLMs), such as ChatGPT and their potential implementation in clinical practice, clinicians may benefit from a baseline understanding of some aspects of AI. In this narrative review, we seek to clarify underlying concepts, illustrate applications to otolaryngology, and highlight future directions and limitations of these tools. METHODS: Recent literature regarding AI principles and otolaryngologic applications of ML and LLMs was reviewed via search in PubMed and Google Scholar. RESULTS: Significant recent strides have been made in otolaryngology research utilizing AI and ML, across all subspecialties, including neurotology, head and neck oncology, laryngology, rhinology, and sleep surgery. Potential applications suggested by recent publications include screening and diagnosis, predictive tools, clinical decision support, and clinical workflow improvement via LLMs. Ongoing concerns regarding AI in medicine include ethical concerns around bias and data sharing, as well as the ""black box"" problem and limitations in explainability. CONCLUSIONS: Potential implementations of AI in otolaryngology are rapidly expanding. While implementation in clinical practice remains theoretical for most of these tools, their potential power to influence the practice of otolaryngology is substantial.",Alter IL; Chan K; Lechien J; Rameau A
37334036,Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.,2023,Ophthalmology science,,,,"PURPOSE: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space. DESIGN: Evaluation of diagnostic test or technology. PARTICIPANTS: ChatGPT is a publicly available LLM. METHODS: We tested 2 versions of ChatGPT (January 9 ""legacy"" and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey's test to decide if there were meaningful differences between the tested subspecialties. MAIN OUTCOME MEASURES: We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT's outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05. RESULTS: The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% +/- 0.6% and 49.2% +/- 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT's answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology (P < 0.001) and ocular pathology (P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections. CONCLUSION: ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties. FINANCIAL DISCLOSURES: Proprietary or commercial disclosure may be found after the references.",Antaki F; Touma S; Milad D; El-Khoury J; Duval R
37792149,"Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.",2024,Japanese journal of radiology,,,,"PURPOSE: Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE). MATERIALS AND METHODS: In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar's test was used to compare the proportion of correct responses between the LLMs. Fisher's exact test was used to assess the performance of GPT-4 for each topic category. RESULTS: ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4's superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001). CONCLUSION: ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.",Toyama Y; Harigai A; Abe M; Nagano M; Kawabata M; Seki Y; Takase K
38801461,[ChatGPT and the German board examination for ophthalmology: an evaluation].,2024,Die Ophthalmologie,,,,"PURPOSE: In recent years artificial intelligence (AI), as a new segment of computer science, has also become increasingly more important in medicine. The aim of this project was to investigate whether the current version of ChatGPT (ChatGPT 4.0) is able to answer open questions that could be asked in the context of a German board examination in ophthalmology. METHODS: After excluding image-based questions, 10 questions from 15 different chapters/topics were selected from the textbook 1000 questions in ophthalmology (1000 Fragen Augenheilkunde 2nd edition, 2014). ChatGPT was instructed by means of a so-called prompt to assume the role of a board certified ophthalmologist and to concentrate on the essentials when answering. A human expert with considerable expertise in the respective topic, evaluated the answers regarding their correctness, relevance and internal coherence. Additionally, the overall performance was rated by school grades and assessed whether the answers would have been sufficient to pass the ophthalmology board examination. RESULTS: The ChatGPT would have passed the board examination in 12 out of 15 topics. The overall performance, however, was limited with only 53.3% completely correct answers. While the correctness of the results in the different topics was highly variable (uveitis and lens/cataract 100%; optics and refraction 20%), the answers always had a high thematic fit (70%) and internal coherence (71%). CONCLUSION: The fact that ChatGPT 4.0 would have passed the specialist examination in 12 out of 15 topics is remarkable considering the fact that this AI was not specifically trained for medical questions; however, there is a considerable performance variability between the topics, with some serious shortcomings that currently rule out its safe use in clinical practice.",Yaici R; Cieplucha M; Bock R; Moayed F; Bechrakis NE; Berens P; Feltgen N; Friedburg D; Graf M; Guthoff R; Hoffmann EM; Hoerauf H; Hintschich C; Kohnen T; Messmer EM; Nentwich MM; Pleyer U; Schaudig U; Seitz B; Geerling G; Roth M
37577545,From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs.,2023,Research square,,,,"PURPOSE: Large Language Models (LLMs) have shown exceptional performance in various natural language processing tasks, benefiting from their language generation capabilities and ability to acquire knowledge from unstructured text. However, in the biomedical domain, LLMs face limitations that lead to inaccurate and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for organizing structured information. Biomedical Knowledge Graphs (BKGs) have gained significant attention for managing diverse and large-scale biomedical knowledge. The objective of this study is to assess and compare the capabilities of ChatGPT and existing BKGs in question-answering, biomedical knowledge discovery, and reasoning tasks within the biomedical domain. METHODS: We conducted a series of experiments to assess the performance of ChatGPT and the BKGs in various aspects of querying existing biomedical knowledge, knowledge discovery, and knowledge reasoning. Firstly, we tasked ChatGPT with answering questions sourced from the ""Alternative Medicine"" sub-category of Yahoo! Answers and recorded the responses. Additionally, we queried BKG to retrieve the relevant knowledge records corresponding to the questions and assessed them manually. In another experiment, we formulated a prediction scenario to assess ChatGPT's ability to suggest potential drug/dietary supplement repurposing candidates. Simultaneously, we utilized BKG to perform link prediction for the same task. The outcomes of ChatGPT and BKG were compared and analyzed. Furthermore, we evaluated ChatGPT and BKG's capabilities in establishing associations between pairs of proposed entities. This evaluation aimed to assess their reasoning abilities and the extent to which they can infer connections within the knowledge domain. RESULTS: The results indicate that ChatGPT with GPT-4.0 outperforms both GPT-3.5 and BKGs in providing existing information. However, BKGs demonstrate higher reliability in terms of information accuracy. ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs. CONCLUSIONS: To address the limitations observed, future research should focus on integrating LLMs and BKGs to leverage the strengths of both approaches. Such integration would optimize task performance and mitigate potential risks, leading to advancements in knowledge within the biomedical field and contributing to the overall well-being of individuals.",Hou Y; Yeung J; Xu H; Su C; Wang F; Zhang R
39741798,Patient Support in Obstructive Sleep Apnoea by a Large Language Model - ChatGPT 4o on Answering Frequently Asked Questions on First Line Positive Airway Pressure and Second Line Hypoglossal Nerve Stimulation Therapy: A Pilot Study.,2024,Nature and science of sleep,,,,"PURPOSE: Obstructive sleep apnoea (OSA) is a common disease that benefits from early treatment and patient support in order to prevent secondary illnesses. This study assesses the capability of the large language model (LLM) ChatGPT-4o to offer patient support regarding first line positive airway pressure (PAP) and second line hypoglossal nerve stimulation (HGNS) therapy. METHODS: Seventeen questions, each regarding PAP and HGNS therapy, were posed to ChatGPT-4o. Answers were rated by experienced experts in sleep medicine on a 6-point Likert scale in the categories of medical adequacy, conciseness, coherence, and comprehensibility. Completeness of medical information and potential hazard for patients were rated using a binary system. RESULTS: Overall, ChatGPT-4o achieved reasonably high ratings in all categories. In medical adequacy, it performed significantly better on PAP questions (mean 4.9) compared to those on HGNS (mean 4.6) (p < 0.05). Scores for coherence, comprehensibility and conciseness showed similar results for both HGNS and PAP answers. Raters confirmed completeness of responses in 45 of 51 ratings (88.24%) for PAP answers and 28 of 51 ratings (54.9%) for HGNS answers. Potential hazards for patients were stated in 2 of 52 ratings (0.04%) for PAP answers and none for HGNS answers. CONCLUSION: ChatGPT-4o has potential as a valuable patient-oriented support tool in sleep medicine therapy that can enhance subsequent face-to-face consultations with a sleep specialist. However, some substantial flaws regarding second line HGNS therapy are most likely due to recent advances in HGNS therapy and the consequent limited information available in LLM training data.",Pordzik J; Bahr-Hamm K; Huppertz T; Gouveris H; Seifen C; Blaikie A; Matthias C; Kuhn S; Eckrich J; Buhr CR
39661913,Comparative Analysis of Generative Pre-Trained Transformer Models in Oncogene-Driven Non-Small Cell Lung Cancer: Introducing the Generative Artificial Intelligence Performance Score.,2024,JCO clinical cancer informatics,,,,"PURPOSE: Precision oncology in non-small cell lung cancer (NSCLC) relies on biomarker testing for clinical decision making. Despite its importance, challenges like the lack of genomic oncology training, nonstandardized biomarker reporting, and a rapidly evolving treatment landscape hinder its practice. Generative artificial intelligence (AI), such as ChatGPT, offers promise for enhancing clinical decision support. Effective performance metrics are crucial to evaluate these models' accuracy and their propensity for producing incorrect or hallucinated information. We assessed various ChatGPT versions' ability to generate accurate next-generation sequencing reports and treatment recommendations for NSCLC, using a novel Generative AI Performance Score (G-PS), which considers accuracy, relevancy, and hallucinations. METHODS: We queried ChatGPT versions for first-line NSCLC treatment recommendations with an Food and Drug Administration-approved targeted therapy, using a zero-shot prompt approach for eight oncogenes. Responses were assessed against National Comprehensive Cancer Network (NCCN) guidelines for accuracy, relevance, and hallucinations, with G-PS calculating scores from -1 (all hallucinations) to 1 (fully NCCN-compliant recommendations). G-PS was designed as a composite measure with a base score for correct recommendations (weighted for preferred treatments) and a penalty for hallucinations. RESULTS: Analyzing 160 responses, generative pre-trained transformer (GPT)-4 outperformed GPT-3.5, showing higher base score (90% v 60%; P < .01) and fewer hallucinations (34% v 53%; P < .01). GPT-4's overall G-PS was significantly higher (0.34 v -0.15; P < .01), indicating superior performance. CONCLUSION: This study highlights the rapid improvement of generative AI in matching treatment recommendations with biomarkers in precision oncology. Although the rate of hallucinations improved in the GPT-4 model, future generative AI use in clinical care requires high levels of accuracy with minimal to no room for hallucinations. The GP-S represents a novel metric quantifying generative AI utility in health care compared with national guidelines, with potential adaptation beyond precision oncology.",Hamilton Z; Aseem A; Chen Z; Naffakh N; Reizine NM; Weinberg F; Jain S; Kessler LG; Gadi VK; Bun C; Nguyen RH
37668790,Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations.,2023,Archives of gynecology and obstetrics,,,,"PURPOSE: Previous studies of ChatGPT performance in the field of medical examinations have reached contradictory results. Moreover, the performance of ChatGPT in other languages other than English is yet to be explored. We aim to study the performance of ChatGPT in Hebrew OBGYN-'Shlav-Alef' (Phase 1) examination. METHODS: A performance study was conducted using a consecutive sample of text-based multiple choice questions, originated from authentic Hebrew OBGYN-'Shlav-Alef' examinations in 2021-2022. We constructed 150 multiple choice questions from consecutive text-based-only original questions. We compared the performance of ChatGPT performance to the real-life actual performance of OBGYN residents who completed the tests in 2021-2022. We also compared ChatGTP Hebrew performance vs. previously published English medical tests. RESULTS: In 2021-2022, 27.8% of OBGYN residents failed the 'Shlav-Alef' examination and the mean score of the residents was 68.4. Overall, 150 authentic questions were evaluated (one examination). ChatGPT correctly answered 58 questions (38.7%) and reached a failed score. The performance of Hebrew ChatGPT was lower when compared to actual performance of residents: 38.7% vs. 68.4%, p < .001. In a comparison to ChatGPT performance in 9,091 English language questions in the field of medicine, the performance of Hebrew ChatGPT was lower (38.7% in Hebrew vs. 60.7% in English, p < .001). CONCLUSIONS: ChatGPT answered correctly on less than 40% of Hebrew OBGYN resident examination questions. Residents cannot rely on ChatGPT for the preparation of this examination. Efforts should be made to improve ChatGPT performance in other languages besides English.",Cohen A; Alter R; Lessans N; Meyer R; Brezinov Y; Levin G
37530687,Exploring the Role of a Large Language Model on Carpal Tunnel Syndrome Management: An Observation Study of ChatGPT.,2023,The Journal of hand surgery,,,,"PURPOSE: Recently, large language models, such as ChatGPT, have emerged as promising tools to facilitate scientific research and health care management. The present study aimed to explore the extent of knowledge possessed by ChatGPT concerning carpal tunnel syndrome (CTS), a compressive neuropathy that may lead to impaired hand function and that is frequently encountered in the field of hand surgery. METHODS: Six questions pertaining to diagnosis and management of CTS were posed to ChatGPT. The responses were subsequently analyzed and evaluated based on their accuracy, coherence, and comprehensiveness. In addition, ChatGPT was requested to provide five high-level evidence references in support of its answers. A simulated doctor-patient consultation was also conducted to assess whether ChatGPT could offer safe medical advice. RESULTS: ChatGPT supplied clinically relevant information regarding CTS, although at a relatively superficial level. In the context of doctor-patient interaction, ChatGPT suggested a diagnostic pathway that deviated from the widely accepted clinical consensus on CTS diagnosis. Nevertheless, it incorporated differential diagnoses and valuable management options for CTS. Although ChatGPT demonstrated the ability to retain and recall information from previous patient conversations, it infrequently produced pertinent references, many of which were either nonexistent or incorrect. CONCLUSIONS: ChatGPT displayed the capability to deliver validated medical information on CTS to nonmedical individuals. However, the generation of nonexistent and inaccurate references by ChatGPT presents a challenge to academic integrity. CLINICAL RELEVANCE: To increase their utility in medicine and academia, large language models must go through specialized reputable data set training and validation from experts. It is essential to note that at present, large language models cannot replace the expertise of health care professionals and may act as a supportive tool.",Seth I; Xie Y; Rodwell A; Gracias D; Bulloch G; Hunter-Smith DJ; Rozen WM
37566133,Will artificial intelligence chatbots replace clinical pharmacologists? An exploratory study in clinical practice.,2023,European journal of clinical pharmacology,,,,"PURPOSE: Recently, there has been a growing interest in using ChatGPT for various applications in Medicine. We evaluated the interest of OpenAI chatbot (GPT 4.0) for drug information activities at Toulouse Pharmacovigilance Center. METHODS: Based on a series of 50 randomly selected questions sent to our pharmacovigilance center by healthcare professionals or patients, we compared the level of responses from the chatbot GPT 4.0 with those provided by specialists in pharmacovigilance. RESULTS: Chatbot answers were globally not acceptable. Responses to inquiries regarding the assessment of drug causality were not consistently precise or clinically meaningful. CONCLUSION: The interest of chatbot assistance needs to be confirmed or rejected through further studies conducted in other pharmacovigilance centers.",Montastruc F; Storck W; de Canecaude C; Victor L; Li J; Cesbron C; Zelmat Y; Barus R
38821410,Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients.,2024,Revista espanola de medicina nuclear e imagen molecular,,,,"PURPOSE: Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients. METHODS: Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL). RESULTS: The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (p < 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (p < 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4. CONCLUSION: Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using 'prompt engineering' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.",San H; Bayrakci O; Cagdas B; Serdengecti M; Alagoz E
39831699,A randomised cross-over trial assessing the impact of AI-generated individual feedback on written online assignments for medical students.,2025,Medical teacher,,,,"PURPOSE: Self-testing has been proven to significantly improve not only simple learning outcomes, but also higher-order skills such as clinical reasoning in medical students. Previous studies have shown that self-testing was especially beneficial when it was presented with feedback, which leaves the question whether an immediate and personalized feedback further encourages this effect. Therefore, we hypothesised that individual feedback has a greater effect on learning outcomes, compared to generic feedback. MATERIALS AND METHODS: In a randomised cross-over trial, German medical students were invited to voluntarily answer daily key-feature questions via an App. For half of the items they received a generalised feedback by an expert, while the feedback on the other half was generated immediately through ChatGPT. After the intervention, the students participated in a mandatory exit exam. RESULTS: Those participants who used the app more frequently experienced a better learning outcome compared to those who did not use it frequently, even though this finding was only examined in a correlative nature. The individual ChatGPT generated feedback did not show a greater effect on exit exam scores compared to the expert comment (51.8 +/- 22.0% vs. 55.8 +/- 22.8%; p = 0.06). CONCLUSION: This study proves the concept of providing personalised feedback on medical questions. Despite the promising results, improved prompting and further development of the application seems necessary to strengthen the possible impact of the personalised feedback. Our study closes a research gap and holds great potential for further use not only in medicine but also in other academic fields.",Nissen L; Rother JF; Heinemann M; Reimer LM; Jonas S; Raupach T
38304112,Exploring Capabilities of Large Language Models such as ChatGPT in Radiation Oncology.,2024,Advances in radiation oncology,,,,"PURPOSE: Technological progress of machine learning and natural language processing has led to the development of large language models (LLMs), capable of producing well-formed text responses and providing natural language access to knowledge. Modern conversational LLMs such as ChatGPT have shown remarkable capabilities across a variety of fields, including medicine. These models may assess even highly specialized medical knowledge within specific disciplines, such as radiation therapy. We conducted an exploratory study to examine the capabilities of ChatGPT to answer questions in radiation therapy. METHODS AND MATERIALS: A set of multiple-choice questions about clinical, physics, and biology general knowledge in radiation oncology as well as a set of open-ended questions were created. These were given as prompts to the LLM ChatGPT, and the answers were collected and analyzed. For the multiple-choice questions, it was checked how many of the answers of the model could be clearly assigned to one of the allowed multiple-choice-answers, and the proportion of correct answers was determined. For the open-ended questions, independent blinded radiation oncologists evaluated the quality of the answers regarding correctness and usefulness on a 5-point Likert scale. Furthermore, the evaluators were asked to provide suggestions for improving the quality of the answers. RESULTS: For 70 multiple-choice questions, ChatGPT gave valid answers in 66 cases (94.3%). In 60.61% of the valid answers, the selected answer was correct (50.0% of clinical questions, 78.6% of physics questions, and 58.3% of biology questions). For 25 open-ended questions, 12 answers of ChatGPT were considered as ""acceptable,"" ""good,"" or ""very good"" regarding both correctness and helpfulness by all 6 participating radiation oncologists. Overall, the answers were considered ""very good"" in 29.3% and 28%, ""good"" in 28% and 29.3%, ""acceptable"" in 19.3% and 19.3%, ""bad"" in 9.3% and 9.3%, and ""very bad"" in 14% and 14% regarding correctness/helpfulness. CONCLUSIONS: Modern conversational LLMs such as ChatGPT can provide satisfying answers to many relevant questions in radiation therapy. As they still fall short of consistently providing correct information, it is problematic to use them for obtaining medical information. As LLMs will further improve in the future, they are expected to have an increasing impact not only on general society, but also on clinical practice, including radiation oncology.",Dennstadt F; Hastings J; Putora PM; Vu E; Fischer GF; Suveg K; Glatzer M; Riggenbach E; Ha HL; Cihoric N
38717951,Using ChatGPT-4 to Analyze 24-Hour Urine Results and Generate Custom Dietary Recommendations for Nephrolithiasis.,2024,Journal of endourology,,,,"Purpose: The increasing incidence of nephrolithiasis underscores the need for effective, accessible tools to aid urologists in preventing recurrence. Despite dietary modification's crucial role in prevention, targeted dietary counseling using 24-hour urine collections is underutilized. This study evaluates ChatGPT-4, a multimodal large language model, in analyzing urine collection results and providing custom dietary advice, exploring the potential for artificial intelligence-assisted analysis and counseling. Materials and Methods: Eleven unique prompts with synthesized 24-hour urine collection results were submitted to ChatGPT-4. The model was instructed to provide five dietary recommendations in response to the results. One prompt contained all ""normal"" values, with subsequent prompts introducing one abnormality each. Generated responses were assessed for accuracy, completeness, and appropriateness by two urologists, a nephrologist, and a clinical dietitian. Results: ChatGPT-4 achieved average scores of 5.2/6 for accuracy, 2.4/3 for completeness, and 2.6/3 for appropriateness. It correctly identified all ""normal"" values but had difficulty consistently detecting abnormalities and formulating appropriate recommendations. The model performed particularly poorly in response to calcium and citrate abnormalities and failed to address 3/10 abnormalities entirely. Conclusions: ChatGPT-4 exhibits potential in the dietary management of nephrolithiasis but requires further refinement for dependable performance. The model demonstrated the ability to generate personalized recommendations that were often accurate and complete but displayed inconsistencies in identifying and addressing urine abnormalities. Despite these limitations, with precise prompt design, physician oversight, and continued training, ChatGPT-4 can serve as a foundation for personalized medicine while also reducing administrative burden, indicating its promising role in improving the management of conditions such as nephrolithiasis.",Kiriakedis S; Duty B; Chase T; Wusirika R; Metzler I
37776392,Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis.,2024,European journal of orthopaedic surgery & traumatology : orthopedie traumatologie,,,,"PURPOSE: The integration of artificial intelligence (AI) tools, such as ChatGPT, in clinical medicine and medical education has gained significant attention due to their potential to support decision-making and improve patient care. However, there is a need to evaluate the benefits and limitations of these tools in specific clinical scenarios. METHODS: This study used a case study approach within the field of orthopaedic surgery. A clinical case report featuring a 53-year-old male with a femoral neck fracture was used as the basis for evaluation. ChatGPT, a large language model, was asked to respond to clinical questions related to the case. The responses generated by ChatGPT were evaluated qualitatively, considering their relevance, justification, and alignment with the responses of real clinicians. Alternative dialogue protocols were also employed to assess the impact of additional prompts and contextual information on ChatGPT responses. RESULTS: ChatGPT generally provided clinically appropriate responses to the questions posed in the clinical case report. However, the level of justification and explanation varied across the generated responses. Occasionally, clinically inappropriate responses and inconsistencies were observed in the generated responses across different dialogue protocols and on separate days. CONCLUSIONS: The findings of this study highlight both the potential and limitations of using ChatGPT in clinical practice. While ChatGPT demonstrated the ability to provide relevant clinical information, the lack of consistent justification and occasional clinically inappropriate responses raise concerns about its reliability. These results underscore the importance of careful consideration and validation when using AI tools in healthcare. Further research and clinician training are necessary to effectively integrate AI tools like ChatGPT, ensuring their safe and reliable use in clinical decision-making.",Zhou Y; Moon C; Szatkowski J; Moore D; Stevens J
38661379,Keeping Up With ChatGPT: Evaluating Its Recognition and Interpretation of Nuclear Medicine Images.,2024,Clinical nuclear medicine,,,,"PURPOSE: The latest iteration of GPT4 (generative pretrained transformer) is a large multimodal model that can integrate both text and image input, but its performance with medical images has not been systematically evaluated. We studied whether ChatGPT with GPT-4V(ision) can recognize images from common nuclear medicine examinations and interpret them. PATIENTS AND METHODS: Fifteen representative images (scintigraphy, 11; PET, 4) were submitted to ChatGPT with GPT-4V(ision), both in its Default and ""Advanced Data Analysis (beta)"" version. ChatGPT was asked to name the type of examination and tracer, explain the findings and whether there are abnormalities. ChatGPT should also mark anatomical structures or pathological findings. The appropriateness of the responses was rated by 3 nuclear medicine physicians. RESULTS: The Default version identified the examination and the tracer correctly in the majority of the 15 cases (60% or 53%) and gave an ""appropriate"" description of the findings or abnormalities in 47% or 33% of cases, respectively. The Default version cannot manipulate images. ""Advanced Data Analysis (beta)"" failed in all tasks in >90% of cases. A ""major"" or ""incompatible"" inconsistency between 3 trials of the same prompt was observed in 73% (Default version) or 87% of cases (""Advanced Data Analysis (beta)"" version). CONCLUSIONS: Although GPT-4V(ision) demonstrates preliminary capabilities in analyzing nuclear medicine images, it exhibits significant limitations, particularly in its reliability (ie, correctness, predictability, and consistency).",Rogasch JMM; Jochens HV; Metzger G; Wetz C; Kaufmann J; Furth C; Amthauer H; Schatka I
37790756,Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology.,2023,Frontiers in oncology,,,,"PURPOSE: The potential of large language models in medicine for education and decision-making purposes has been demonstrated as they have achieved decent scores on medical exams such as the United States Medical Licensing Exam (USMLE) and the MedQA exam. This work aims to evaluate the performance of ChatGPT-4 in the specialized field of radiation oncology. METHODS: The 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases are used to benchmark the performance of ChatGPT-4. The TXIT exam contains 300 questions covering various topics of radiation oncology. The 2022 Gray Zone collection contains 15 complex clinical cases. RESULTS: For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 62.05% and 78.77%, respectively, highlighting the advantage of the latest ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in radiation oncology are identified to some extent. Specifically, ChatGPT-4 demonstrates better knowledge of statistics, CNS & eye, pediatrics, biology, and physics than knowledge of bone & soft tissue and gynecology, as per the ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry. It lacks proficiency in in-depth details of clinical trials. For the Gray Zone cases, ChatGPT-4 is able to suggest a personalized treatment approach to each case with high correctness and comprehensiveness. Importantly, it provides novel treatment aspects for many cases, which are not suggested by any human experts. CONCLUSION: Both evaluations demonstrate the potential of ChatGPT-4 in medical education for the general public and cancer patients, as well as the potential to aid clinical decision-making, while acknowledging its limitations in certain domains. Owing to the risk of hallucinations, it is essential to verify the content generated by models such as ChatGPT for accuracy.",Huang Y; Gomaa A; Semrau S; Haderlein M; Lettmaier S; Weissmann T; Grigo J; Tkhayat HB; Frey B; Gaipl U; Distel L; Maier A; Fietkau R; Bert C; Putz F
38066714,Genetic counselors' utilization of ChatGPT in professional practice: A cross-sectional study.,2024,American journal of medical genetics. Part A,,,,"PURPOSE: The precision medicine era has seen increased utilization of artificial intelligence (AI) in the field of genetics. We sought to explore the ways that genetic counselors (GCs) currently use the publicly accessible AI tool Chat Generative Pre-trained Transformer (ChatGPT) in their work. METHODS: GCs in North America were surveyed about how ChatGPT is used in different aspects of their work. Descriptive statistics were reported through frequencies and means. RESULTS: Of 118 GCs who completed the survey, 33.8% (40) reported using ChatGPT in their work; 47.5% (19) use it in clinical practice, 35% (14) use it in education, and 32.5% (13) use it in research. Most GCs (62.7%; 74) felt that it saves time on administrative tasks but the majority (82.2%; 97) felt that a paramount challenge was the risk of obtaining incorrect information. The majority of GCs not using ChatGPT (58.9%; 46) felt it was not necessary for their work. CONCLUSION: A considerable number of GCs in the field are using ChatGPT in different ways, but it is primarily helpful with tasks that involve writing. It has potential to streamline workflow issues encountered in clinical genetics, but practitioners need to be informed and uniformly trained about its limitations.",Ahimaz P; Bergner AL; Florido ME; Harkavy N; Bhattacharyya S
40352999,Fruits of the Professional Educator Appreciation and Recognition (PEAR) Awards: Learning what Students Value in Their Medical Educators.,2025,Medical science educator,,,,"PURPOSE: The Professional Educator Appreciation and Recognition (PEAR) awards program was created by students at Baylor College of Medicine (BCM) in 2020 to recognize exemplary educators. We reviewed our 3-year experience of this initiative to identify characteristics of award-winning educators, and to assess the award's impact. MATERIALS AND METHODS: We reviewed the title, department, and clinical affiliation of award winners. Using ChatGPT, we qualitatively analyzed de-identified nomination narratives to identify themes of educator characteristics most valued by students. Following refinement by the study team, each award was then assigned one or more themes. Unsolicited thank-you emails were reviewed to consider the impact of the award upon recipients. RESULTS: Of 202 award recipients, most were near-peers (two-thirds were assistant professors or residents). Winners were affiliated with diverse departments and disciplines. Students most valued outstanding teaching skills (36.6%), showing support by being kind, encouraging, or approachable (26.7%), and investing time in trainees via constructive feedback (26.7%). Unsolicited thank-you emails from 20% of recipients conveyed the meaning of the award for the winning educators. CONCLUSIONS: A review of the BCM PEAR award program identified educator characteristics most valued by students, providing targets for professional development. This low-cost, high-impact initiative may enhance educator wellness and motivate professional behaviors. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s40670-024-02234-2.",Tomlinson M; Nasto K; Gosselin K; Friedman EM; Gill A; Rose S
39110155,Accuracy and Completeness of Large Language Models About Antibody-Drug Conjugates and Associated Ocular Adverse Effects.,2024,Cornea,,,,"PURPOSE: The purpose of this study was to assess the accuracy and completeness of 3 large language models (LLMs) to generate information about antibody-drug conjugate (ADC)-associated ocular toxicities. METHODS: There were 22 questions about ADCs, tisotumab vedotin, and mirvetuximab soravtansine that were developed and input into ChatGPT 4.0, Bard, and LLaMa. Answers were rated by 4 ocular toxicity experts using standardized 6-point Likert scales on accuracy and completeness. ANOVA tests were conducted for comparison between the 3 subgroups, followed by pairwise t-tests. Interrater variability was assessed with Fleiss kappa tests. RESULTS: The mean accuracy score was 4.62 (SD 0.89) for ChatGPT, 4.77 (SD 0.90) for Bard, and 4.41 (SD 1.09) for LLaMA. Both ChatGPT (P = 0.03) and Bard (P = 0.003) scored significantly better for accuracy when compared with LLaMA. The mean completeness score was 4.43 (SD 0.91) for ChatGPT, 4.57 (SD 0.93) for Bard, and 4.42 (SD 0.99) for LLaMA. There were no significant differences in completeness scores between groups. Fleiss kappa assessment for interrater variability was good (0.74) for accuracy and fair (0.31) for completeness. CONCLUSIONS: All 3 LLMs had relatively high accuracy and completeness ratings, showing LLMs are able to provide sufficient answers for niche topics of ophthalmology. Our results indicate that ChatGPT and Bard may be slightly better at providing more accurate answers than LLaMA. As further research and treatment plans are developed for ADC-associated ocular toxicities, these LLMs should be reassessed to see if they provide complete and accurate answers that remain in line with current medical knowledge.",Marshall R; Xu H; Dalvin LA; Mishra K; Edalat C; Kirupaharan N; Francis JH; Berkenstock M
38615289,"Geriatrics and artificial intelligence in Spain (Ger-IA project): talking to ChatGPT, a nationwide survey.",2024,European geriatric medicine,,,,"PURPOSE: The purposes of the study was to describe the degree of agreement between geriatricians with the answers given by an AI tool (ChatGPT) in response to questions related to different areas in geriatrics, to study the differences between specialists and residents in geriatrics in terms of the degree of agreement with ChatGPT, and to analyse the mean scores obtained by areas of knowledge/domains. METHODS: An observational study was conducted involving 126 doctors from 41 geriatric medicine departments in Spain. Ten questions about geriatric medicine were posed to ChatGPT, and doctors evaluated the AI's answers using a Likert scale. Sociodemographic variables were included. Questions were categorized into five knowledge domains, and means and standard deviations were calculated for each. RESULTS: 130 doctors answered the questionnaire. 126 doctors (69.8% women, mean age 41.4 [9.8]) were included in the final analysis. The mean score obtained by ChatGPT was 3.1/5 [0.67]. Specialists rated ChatGPT lower than residents (3.0/5 vs. 3.3/5 points, respectively, P < 0.05). By domains, ChatGPT ​​scored better (M: 3.96; SD: 0.71) in general/theoretical questions rather than in complex decisions/end-of-life situations (M: 2.50; SD: 0.76) and answers related to diagnosis/performing of complementary tests obtained the lowest ones (M: 2.48; SD: 0.77). CONCLUSION: Scores presented big variability depending on the area of knowledge. Questions related to theoretical aspects of challenges/future in geriatrics obtained better scores. When it comes to complex decision-making, appropriateness of the therapeutic efforts or decisions about diagnostic tests, professionals indicated a poorer performance. AI is likely to be incorporated into some areas of medicine, but it would still present important limitations, mainly in complex medical decision-making.",Rossello-Jimenez D; Docampo S; Collado Y; Cuadra-Llopart L; Riba F; Llonch-Masriera M
38206515,"Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale.",2024,CJEM,,,,"PURPOSE: The release of the ChatGPT prototype to the public in November 2022 drastically reduced the barrier to using artificial intelligence by allowing easy access to a large language model with only a simple web interface. One situation where ChatGPT could be useful is in triaging patients arriving to the emergency department. This study aimed to address the research problem: ""can emergency physicians use ChatGPT to accurately triage patients using the Canadian Triage and Acuity Scale (CTAS)?"". METHODS: Six unique prompts were developed independently by five emergency physicians. An automated script was used to query ChatGPT with each of the 6 prompts combined with 61 validated and previously published patient vignettes. Thirty repetitions of each combination were performed for a total of 10,980 simulated triages. RESULTS: In 99.6% of 10,980 queries, a CTAS score was returned. However, there was considerable variations in results. Repeatability (use of the same prompt repeatedly) was responsible for 21.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.0% of overall variation. Overall accuracy of ChatGPT to triage simulated patients was 47.5% with a 13.7% under-triage rate and a 38.7% over-triage rate. More extensively detailed text given as a prompt was associated with greater reproducibility, but minimal increase in accuracy. CONCLUSIONS: This study suggests that the current ChatGPT large language model is not sufficient for emergency physicians to triage simulated patients using the Canadian Triage and Acuity Scale due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may frequently provide false information.",Franc JM; Cheng L; Hart A; Hata R; Hertelendy A
38217726,ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology-head and neck surgery.,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"PURPOSE: The usage of Chatbots as a kind of Artificial Intelligence in medicine is getting to increase in recent years. UpToDate(R) is another well-known search tool established on evidence-based knowledge and is used daily by doctors worldwide. In this study, we aimed to investigate the usefulness and reliability of ChatGPT compared to UpToDate in Otorhinolaryngology and Head and Neck Surgery (ORL-HNS). MATERIALS AND METHODS: ChatGPT-3.5 and UpToDate were interrogated for the management of 25 common clinical case scenarios (13 males/12 females) recruited from literature considering the daily observation at the Department of Otorhinolaryngology of Ege University Faculty of Medicine. Scientific references for the management were requested for each clinical case. The accuracy of the references in the ChatGPT answers was assessed on a 0-2 scale and the usefulness of the ChatGPT and UpToDate answers was assessed with 1-3 scores by reviewers. UpToDate and ChatGPT 3.5 responses were compared. RESULTS: ChatGPT did not give references in some questions in contrast to UpToDate. Information on the ChatGPT was limited to 2021. UpToDate supported the paper with subheadings, tables, figures, and algorithms. The mean accuracy score of references in ChatGPT answers was 0.25-weak/unrelated. The median (Q1-Q3) was 1.00 (1.25-2.00) for ChatGPT and 2.63 (2.75-3.00) for UpToDate, the difference was statistically significant (p < 0.001). UpToDate was observed more useful and reliable than ChatGPT. CONCLUSIONS: ChatGPT has the potential to support the physicians to find out the information but our results suggest that ChatGPT needs to be improved to increase the usefulness and reliability of medical evidence-based knowledge.",Karimov Z; Allahverdiyev I; Agayarov OY; Demir D; Almuradova E
38218659,An evaluation of AI generated literature reviews in musculoskeletal radiology.,2024,The surgeon : journal of the Royal Colleges of Surgeons of Edinburgh and Ireland,,,,"PURPOSE: The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (the-literature.com) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to. METHODS: The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1-5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists. RESULTS: The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883). CONCLUSION: Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.",Jenko N; Ariyaratne S; Jeys L; Evans S; Iyengar KP; Botchu R
38977032,Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study.,2024,Journal of educational evaluation for health professions,,,,"PURPOSE: This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States. METHODS: In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024. RESULTS: GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P>0.0001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items. CONCLUSION: s: ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology's Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.",Yudovich MS; Makarova E; Hague CM; Raman JD
39349172,Evaluacion de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de informacion al paciente para las exploraciones PET-TC mas communes.,2025,Revista espanola de medicina nuclear e imagen molecular,,,,"PURPOSE: This study aimed to evaluate the reliability and readability of responses generated by two popular AI-chatbots, 'ChatGPT-4.0' and 'Google Gemini', to potential patient questions about PET/CT scans. MATERIALS AND METHODS: Thirty potential questions for each of [(18)F]FDG and [(68)Ga]Ga-DOTA-SSTR PET/CT, and twenty-nine potential questions for [(68)Ga]Ga-PSMA PET/CT were asked separately to ChatGPT-4 and Gemini in May 2024. The responses were evaluated for reliability and readability using the modified DISCERN (mDISCERN) scale, Flesch Reading Ease (FRE), Gunning Fog Index (GFI), and Flesch-Kincaid Reading Grade Level (FKRGL). The inter-rater reliability of mDISCERN scores provided by three raters (ChatGPT-4, Gemini, and a nuclear medicine physician) for the responses was assessed. RESULTS: The median [min-max] mDISCERN scores reviewed by the physician for responses about FDG, PSMA and DOTA PET/CT scans were 3.5 [2-4], 3 [3-4], 3 [3-4] for ChatPT-4 and 4 [2-5], 4 [2-5], 3.5 [3-5] for Gemini, respectively. The mDISCERN scores assessed using ChatGPT-4 for answers about FDG, PSMA, and DOTA-SSTR PET/CT scans were 3.5 [3-5], 3 [3-4], 3 [2-3] for ChatGPT-4, and 4 [3-5], 4 [3-5], 4 [3-5] for Gemini, respectively. The mDISCERN scores evaluated using Gemini for responses FDG, PSMA, and DOTA-SSTR PET/CTs were 3 [2-4], 2 [2-4], 3 [2-4] for ChatGPT-4, and 3 [2-5], 3 [1-5], 3 [2-5] for Gemini, respectively. The inter-rater reliability correlation coefficient of mDISCERN scores for ChatGPT-4 responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.629 (95% CI = 0,32-0,812), 0.707 (95% CI = 0.458-0.853) and 0.738 (95% CI = 0.519-0.866), respectively (p < 0.001). The correlation coefficient of mDISCERN scores for Gemini responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.824 (95% CI = 0.677-0.910), 0.881 (95% CI = 0.78-0.94) and 0.847 (95% CI = 0.719-0.922), respectively (p < 0.001). The mDISCERN scores assessed by ChatGPT-4, Gemini, and the physician showed that the chatbots' responses about all PET/CT scans had moderate to good statistical agreement according to the inter-rater reliability correlation coefficient (p < 0,001). There was a statistically significant difference in all readability scores (FKRGL, GFI, and FRE) of ChatGPT-4 and Gemini responses about PET/CT scans (p < 0,001). Gemini responses were shorter and had better readability scores than ChatGPT-4 responses. CONCLUSION: There was an acceptable level of agreement between raters for the mDISCERN score, indicating agreement with the overall reliability of the responses. However, the information provided by AI-chatbots cannot be easily read by the public.",Aydinbelge-Dizdar N; Dizdar K
37166289,ChatGPT and Lacrimal Drainage Disorders: Performance and Scope of Improvement.,2023,Ophthalmic plastic and reconstructive surgery,,,,"PURPOSE: This study aimed to report the performance of the large language model ChatGPT (OpenAI, San Francisco, CA, U.S.A.) in the context of lacrimal drainage disorders. METHODS: A set of prompts was constructed through questions and statements spanning common and uncommon aspects of lacrimal drainage disorders. Care was taken to avoid constructing prompts that had significant or new knowledge beyond the year 2020. Each of the prompts was presented thrice to ChatGPT. The questions covered common disorders such as primary acquired nasolacrimal duct obstruction and congenital nasolacrimal duct obstruction and their cause and management. The prompts also tested ChatGPT on certain specifics, such as the history of dacryocystorhinostomy (DCR) surgery, lacrimal pump anatomy, and human canalicular surfactants. ChatGPT was also quizzed on controversial topics such as silicone intubation and the use of mitomycin C in DCR surgery. The responses of ChatGPT were carefully analyzed for evidence-based content, specificity of the response, presence of generic text, disclaimers, factual inaccuracies, and its abilities to admit mistakes and challenge incorrect premises. Three lacrimal surgeons graded the responses into three categories: correct, partially correct, and factually incorrect. RESULTS: A total of 21 prompts were presented to the ChatGPT. The responses were detailed and were based according to the prompt structure. In response to most questions, ChatGPT provided a generic disclaimer that it could not give medical advice or professional opinion but then provided an answer to the question in detail. Specific prompts such as ""how can I perform an external DCR?"" were responded by a sequential listing of all the surgical steps. However, several factual inaccuracies were noted across many ChatGPT replies. Several responses on controversial topics such as silicone intubation and mitomycin C were generic and not precisely evidence-based. ChatGPT's response to specific questions such as canalicular surfactants and idiopathic canalicular inflammatory disease was poor. The presentation of variable prompts on a single topic led to responses with either repetition or recycling of the phrases. Citations were uniformly missing across all responses. Agreement among the three observers was high (95%) in grading the responses. The responses of ChatGPT were graded as correct for only 40% of the prompts, partially correct in 35%, and outright factually incorrect in 25%. Hence, some degree of factual inaccuracy was present in 60% of the responses, if we consider the partially correct responses. The exciting aspect was that ChatGPT was able to admit mistakes and correct them when presented with counterarguments. It was also capable of challenging incorrect prompts and premises. CONCLUSION: The performance of ChatGPT in the context of lacrimal drainage disorders, at best, can be termed average. However, the potential of this AI chatbot to influence medicine is enormous. There is a need for it to be specifically trained and retrained for individual medical subspecialties.",Ali MJ
38188345,The opportunities and challenges of adopting ChatGPT in medical research.,2023,Frontiers in medicine,,,,"PURPOSE: This study aims to investigate the opportunities and challenges of adopting ChatGPT in medical research. METHODS: A qualitative approach with focus groups is adopted in this study. A total of 62 participants including academic researchers from different streams in medicine and eHealth, participated in this study. RESULTS: A total of five themes with 16 sub-themes related to the opportunities; and a total of five themes with 12 sub-themes related to the challenges were identified. The major opportunities include improved data collection and analysis, improved communication and accessibility, and support for researchers in multiple streams of medical research. The major challenges identified were limitations of training data leading to bias, ethical issues, technical limitations, and limitations in data collection and analysis. CONCLUSION: Although ChatGPT can be used as a potential tool in medical research, there is a need for further evidence to generalize its impact on the different research activities.",Alsadhan A; Al-Anezi F; Almohanna A; Alnaim N; Alzahrani H; Shinawi R; AboAlsamh H; Bakhshwain A; Alenazy M; Arif W; Alyousef S; Alhamidi S; Alghamdi A; AlShrayfi N; Rubaian NB; Alanzi T; AlSahli A; Alturki R; Herzallah N
37980605,Chat GPT for the management of obstructive sleep apnea: do we have a polar star?,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"PURPOSE: This study explores the potential of the Chat-Generative Pre-Trained Transformer (Chat-GPT), a Large Language Model (LLM), in assisting healthcare professionals in the diagnosis of obstructive sleep apnea (OSA). It aims to assess the agreement between Chat-GPT's responses and those of expert otolaryngologists, shedding light on the role of AI-generated content in medical decision-making. METHODS: A prospective, cross-sectional study was conducted, involving 350 otolaryngologists from 25 countries who responded to a specialized OSA survey. Chat-GPT was tasked with providing answers to the same survey questions. Responses were assessed by both super-experts and statistically analyzed for agreement. RESULTS: The study revealed that Chat-GPT and expert responses shared a common answer in over 75% of cases for individual questions. However, the overall consensus was achieved in only four questions. Super-expert assessments showed a moderate agreement level, with Chat-GPT scoring slightly lower than experts. Statistically, Chat-GPT's responses differed significantly from experts' opinions (p = 0.0009). Sub-analysis revealed areas of improvement for Chat-GPT, particularly in questions where super-experts rated its responses lower than expert consensus. CONCLUSIONS: Chat-GPT demonstrates potential as a valuable resource for OSA diagnosis, especially where access to specialists is limited. The study emphasizes the importance of AI-human collaboration, with Chat-GPT serving as a complementary tool rather than a replacement for medical professionals. This research contributes to the discourse in otolaryngology and encourages further exploration of AI-driven healthcare applications. While Chat-GPT exhibits a commendable level of consensus with expert responses, ongoing refinements in AI-based healthcare tools hold significant promise for the future of medicine, addressing the underdiagnosis and undertreatment of OSA and improving patient outcomes.",Mira FA; Favier V; Dos Santos Sobreira Nunes H; de Castro JV; Carsuzaa F; Meccariello G; Vicini C; De Vito A; Lechien JR; Chiesa-Estomba C; Maniaci A; Iannella G; Rojas EP; Cornejo JB; Cammaroto G
39613920,Diagnostic performance of ChatGPT in tibial plateau fracture in knee X-ray.,2025,Emergency radiology,,,,"PURPOSE: Tibial plateau fractures are relatively common and require accurate diagnosis. Chat Generative Pre-Trained Transformer (ChatGPT) has emerged as a tool to improve medical diagnosis. This study aims to investigate the accuracy of this tool in diagnosing tibial plateau fractures. METHODS: A secondary analysis was performed on 111 knee radiographs from emergency department patients, with 29 confirmed fractures by computed tomography (CT) imaging. The X-rays were reviewed by a board-certified emergency physician (EP) and radiologist and then analyzed by ChatGPT-4 and ChatGPT-4o. The diagnostic performances were compared using the area under the receiver operating characteristic curve (AUC). Sensitivity, specificity, and likelihood ratios were also calculated. RESULTS: The results indicated a sensitivity and negative likelihood ratio of 58.6% (95% CI: 38.9 - 76.4%) and 0.4 (95% CI: 0.3-0.7) for the EP, 72.4% (95% CI: 52.7 - 87.2%) and 0.3 (95% CI: 0.2-0.6) for the radiologist, 27.5% (95% CI: 12.7 - 47.2%) and 0.7 (95% CI: 0.6-0.9) for ChatGPT-4, and 55.1% (95% CI: 35.6 - 73.5%) and 0.4 (95% CI: 0.3-0.7) for ChatGPT4o. The specificity and positive likelihood ratio were 85.3% (95% CI: 75.8 - 92.2%) and 4.0 (95% CI: 2.1-7.3) for the EP, 76.8% (95% CI: 66.2 - 85.4%) and 3.1 (95% CI: 1.9-4.9) for the radiologist, 95.1% (95% CI: 87.9 - 98.6%) and 5.6 (95% CI: 1.8-17.3) for ChatGPT-4, and 93.9% (95% CI: 86.3 - 97.9%) and 9.0 (95% CI: 3.6-22.4) for ChatGPT4o. The area under the receiver operating characteristic curve (AUC) was 0.72 (95% CI: 0.6-0.8) for the EP, 0.75 (95% CI: 0.6-0.8) for the radiologist, 0.61 (95% CI: 0.4-0.7) for ChatGPT-4, and 0.74 (95% CI: 0.6-0.8) for ChatGPT4-o. The EP and radiologist significantly outperformed ChatGPT-4 (P value = 0.02 and 0.01, respectively), whereas there was no significant difference between the EP, ChatGPT-4o, and radiologist. CONCLUSION: ChatGPT-4o matched the physicians' performance and also had the highest specificity. Similar to the physicians, ChatGPT chatbots were not suitable for ruling out the fracture.",Mohammadi M; Parviz S; Parvaz P; Pirmoradi MM; Afzalimoghaddam M; Mirfazaelian H
40321662,Simulation-Based Evaluation of Large Language Models for Comorbidity Detection in Sleep Medicine - a Pilot Study on ChatGPT o1 Preview.,2025,Nature and science of sleep,,,,"PURPOSE: Timely identification of comorbidities is critical in sleep medicine, where large language models (LLMs) like ChatGPT are currently emerging as transformative tools. Here, we investigate whether the novel LLM ChatGPT o1 preview can identify individual health risks or potentially existing comorbidities from the medical data of fictitious sleep medicine patients. METHODS: We conducted a simulation-based study using 30 fictitious patients, designed to represent realistic variations in demographic and clinical parameters commonly seen in sleep medicine. Each profile included personal data (eg, body mass index, smoking status, drinking habits), blood pressure, and routine blood test results, along with a predefined sleep medicine diagnosis. Each patient profile was evaluated independently by the LLM and a sleep medicine specialist (SMS) for identification of potential comorbidities or individual health risks. Their recommendations were compared for concordance across lifestyle changes and further medical measures. RESULTS: The LLM achieved high concordance with the SMS for lifestyle modification recommendations, including 100% concordance on smoking cessation (kappa = 1; p < 0.001), 97% on alcohol reduction (kappa = 0.92; p < 0.001) and endocrinological examination (kappa = 0.92; p < 0.001) or 93% on weight loss (kappa = 0.86; p < 0.001). However, it exhibited a tendency to over-recommend further medical measures (particularly 57% concordance for cardiological examination (kappa = 0.08; p = 0.28) and 33% for gastrointestinal examination (kappa = 0.1; p = 0.22)) compared to the SMS. CONCLUSION: Despite the obvious limitation of using fictitious data, the findings suggest that LLMs like ChatGPT have the potential to complement clinical workflows in sleep medicine by identifying individual health risks and comorbidities. As LLMs continue to evolve, their integration into healthcare could redefine the approach to patient evaluation and risk stratification. Future research should contextualize the findings within broader clinical applications ideally testing locally run LLMs meeting data protection requirements.",Seifen C; Bahr-Hamm K; Gouveris H; Pordzik J; Blaikie A; Matthias C; Kuhn S; Buhr CR
39313138,Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.,2024,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"PURPOSE: To assess the ability of ChatGPT-4 and Gemini to generate accurate and relevant responses to the 2022 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPG) for anterior cruciate ligament reconstruction (ACLR). METHODS: Responses from ChatGPT-4 and Gemini to prompts derived from all 15 AAOS guidelines were evaluated by 7 fellowship-trained orthopaedic sports medicine surgeons using a structured questionnaire assessing 5 key characteristics on a scale from 1 to 5. The prompts were categorized into 3 areas: diagnosis and preoperative management, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and 2-sided t tests to compare the performance between the 2 large language models (LLMs). Scores were then evaluated for inter-rater reliability (IRR). RESULTS: Overall, both LLMs performed well with mean scores >4 for the 5 key characteristics. Gemini demonstrated superior performance in overall clarity (4.848 +/- 0.36 vs 4.743 +/- 0.481, P = .034), but all other characteristics demonstrated nonsignificant differences (P > .05). Gemini also demonstrated superior clarity in the surgical timing and technique (P = .038) as well as the prevention and rehabilitation (P = .044) subcategories. Additionally, Gemini had superior performance completeness scores in the rehabilitation and prevention subcategory (P = .044), but no statistically significant differences were found amongst the other subcategories. The overall IRR was found to be 0.71 (moderate). CONCLUSIONS: Both Gemini and ChatGPT-4 demonstrate an overall good ability to generate accurate and relevant responses to question prompts based on the 2022 AAOS CPG for ACLR. However, Gemini demonstrated superior clarity in multiple domains in addition to superior completeness for questions pertaining to rehabilitation and prevention. CLINICAL RELEVANCE: The current study addresses a current gap in the LLM and ACLR literature by comparing the performance of ChatGPT-4 to Gemini, which is growing in popularity with more than 300 million individual uses in May 2024 alone. Moreover, the results demonstrated superior performance of Gemini in both clarity and completeness, which are critical elements of a tool being used by patients for educational purposes. Additionally, the current study uses question prompts based on the AAOS CPG, which may be used as a method of standardization for future investigations on performance of LLM platforms. Thus, the results of this study may be of interest to both the readership of Arthroscopy and patients.",Quinn M; Milner JD; Schmitt P; Morrissey P; Lemme N; Marcaccio S; DeFroda S; Tabaddor R; Owens BD
38936557,ChatGPT-4 Performs Clinical Information Retrieval Tasks Using Consistently More Trustworthy Resources Than Does Google Search for Queries Concerning the Latarjet Procedure.,2025,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"PURPOSE: To assess the ability of ChatGPT-4, an automated Chatbot powered by artificial intelligence, to answer common patient questions concerning the Latarjet procedure for patients with anterior shoulder instability and compare this performance with Google Search Engine. METHODS: Using previously validated methods, a Google search was first performed using the query ""Latarjet."" Subsequently, the top 10 frequently asked questions (FAQs) and associated sources were extracted. ChatGPT-4 was then prompted to provide the top 10 FAQs and answers concerning the procedure. This process was repeated to identify additional FAQs requiring discrete-numeric answers to allow for a comparison between ChatGPT-4 and Google. Discrete, numeric answers were subsequently assessed for accuracy on the basis of the clinical judgment of 2 fellowship-trained sports medicine surgeons who were blinded to search platform. RESULTS: Mean (+/- standard deviation) accuracy to numeric-based answers was 2.9 +/- 0.9 for ChatGPT-4 versus 2.5 +/- 1.4 for Google (P = .65). ChatGPT-4 derived information for answers only from academic sources, which was significantly different from Google Search Engine (P = .003), which used only 30% academic sources and websites from individual surgeons (50%) and larger medical practices (20%). For general FAQs, 40% of FAQs were found to be identical when comparing ChatGPT-4 and Google Search Engine. In terms of sources used to answer these questions, ChatGPT-4 again used 100% academic resources, whereas Google Search Engine used 60% academic resources, 20% surgeon personal websites, and 20% medical practices (P = .087). CONCLUSIONS: ChatGPT-4 demonstrated the ability to provide accurate and reliable information about the Latarjet procedure in response to patient queries, using multiple academic sources in all cases. This was in contrast to Google Search Engine, which more frequently used single-surgeon and large medical practice websites. Despite differences in the resources accessed to perform information retrieval tasks, the clinical relevance and accuracy of information provided did not significantly differ between ChatGPT-4 and Google Search Engine. CLINICAL RELEVANCE: Commercially available large language models (LLMs), such as ChatGPT-4, can perform diverse information retrieval tasks on-demand. An important medical information retrieval application for LLMs consists of the ability to provide comprehensive, relevant, and accurate information for various use cases such as investigation about a recently diagnosed medical condition or procedure. Understanding the performance and abilities of LLMs for use cases has important implications for deployment within health care settings.",Oeding JF; Lu AZ; Mazzucco M; Fu MC; Taylor SA; Dines DM; Warren RF; Gulotta LV; Dines JS; Kunze KN
39242057,ChatGPT Can Offer At Least Satisfactory Responses to Common Patient Questions Regarding Hip Arthroscopy.,2024,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"PURPOSE: To assess the accuracy of answers provided by ChatGPT 4.0 (an advanced language model developed by OpenAI) regarding 25 common patient questions about hip arthroscopy. METHODS: ChatGPT 4.0 was presented with 25 common patient questions regarding hip arthroscopy with no follow-up questions and repetition. Each response was evaluated by 2 board-certified orthopaedic sports medicine surgeons independently. Responses were rated, with scores of 1, 2, 3, and 4 corresponding to ""excellent response not requiring clarification,"" ""satisfactory requiring minimal clarification,"" ""satisfactory requiring moderate clarification,"" and ""unsatisfactory requiring substantial clarification,"" respectively. RESULTS: Twenty responses were rated ""excellent"" and 2 responses were rated ""satisfactory requiring minimal clarification"" by both of reviewers. Responses to questions ""What kind of anesthesia is used for hip arthroscopy?"" and ""What is the average age for hip arthroscopy?"" were rated as ""satisfactory requiring minimal clarification"" by both reviewers. None of the responses were rated as ""satisfactory requiring moderate clarification"" or ""unsatisfactory"" by either of the reviewers. CONCLUSIONS: ChatGPT 4.0 provides at least satisfactory responses to patient questions regarding hip arthroscopy. Under the supervision of an orthopaedic sports medicine surgeon, it could be used as a supplementary tool for patient education. CLINICAL RELEVANCE: This study compared the answers of ChatGPT to patients' questions regarding hip arthroscopy with the current literature. As ChatGPT has gained popularity among patients, the study aimed to find if the responses that patients get from this chatbot are compatible with the up-to-date literature.",Ozbek EA; Ertan MB; Kindan P; Karaca MO; Gursoy S; Chahla J
38117307,Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard.,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"PURPOSE: To conduct a comparative performance evaluation of GPT-3.5, GPT-4 and Google Bard in self-assessment questions at the level of the American Sleep Medicine Certification Board Exam. METHODS: A total of 301 text-based single-best-answer multiple choice questions with four answer options each, across 10 categories, were included in the study and transcribed as inputs for GPT-3.5, GPT-4 and Google Bard. The first output responses generated were selected and matched for answer accuracy against the gold-standard answer provided by the American Academy of Sleep Medicine for each question. A global score of 80% and above is required by human sleep medicine specialists to pass each exam category. RESULTS: GPT-4 successfully achieved the pass mark of 80% or above in five of the 10 exam categories, including the Normal Sleep and Variants Self-Assessment Exam (2021), Circadian Rhythm Sleep-Wake Disorders Self-Assessment Exam (2021), Insomnia Self-Assessment Exam (2022), Parasomnias Self-Assessment Exam (2022) and the Sleep-Related Movements Self-Assessment Exam (2023). GPT-4 demonstrated superior performance in all exam categories and achieved a higher overall score of 68.1% when compared against both GPT-3.5 (46.8%) and Google Bard (45.5%), which was statistically significant (p value < 0.001). There was no significant difference in the overall score performance between GPT-3.5 and Google Bard. CONCLUSIONS: Otolaryngologists and sleep medicine physicians have a crucial role through agile and robust research to ensure the next generation AI chatbots are built safely and responsibly.",Cheong RCT; Pang KP; Unadkat S; Mcneillis V; Williamson A; Joseph J; Randhawa P; Andrews P; Paleri V
40209833,Large Language Model Use Cases in Health Care Research Are Redundant and Often Lack Appropriate Methodological Conduct: A Scoping Review and Call for Improved Practices.,2025,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"PURPOSE: To describe the current use cases of large language models (LLMs) in musculoskeletal medicine and to evaluate the methodologic conduct of these investigations in order to safeguard future implementation of LLMs in clinical research and identify key areas for methodological improvement. METHODS: A comprehensive literature search was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines using PubMed, Cochrane Library, and Embase databases to identify eligible studies. Included studies evaluated the use of LLMs within any realm of orthopaedic surgery, regardless of its application in a clinical or educational setting. Methodological Index for Non-Randomized Studies criteria was used to assess the quality of all included studies. RESULTS: In total, 114 studies published from 2022 to 2024 were identified. Extensive use case redundancy was observed, and 5 main categories of clinical applications of LLMs were identified: 48 studies (42.1%) that assessed the ability to answer patient questions, 24 studies (21.1%) that evaluated the ability to diagnose and manage medical conditions, 21 studies (18.4%) that evaluated the ability to take orthopaedic examinations, 11 studies (9.6%) that analyzed the ability to develop or evaluate patient educational materials, and 10 studies (8.8%) concerning other applications, such as generating images, generating discharge documents and clinical letters, writing scientific abstracts and manuscripts, and enhancing billing efficiency. General orthopaedics was the focus of most included studies (n = 39, 34.2%), followed by orthopaedic sports medicine (n = 18, 15.8%), and adult reconstructive surgery (n = 17, 14.9%). ChatGPT 3.5 was the most common LLM used or evaluated (n = 79, 69.2%), followed by ChatGPT 4.0 (n = 47, 41.2%). Methodological inconsistency was prevalent among studies, with 36 (31.6%) studies failing to disclose the exact prompts used, 64 (56.1%) failing to disclose the exact outputs generated by the LLM, and only 7 (6.1%) evaluating different prompting strategies to elicit desired outputs. No studies attempted to investigate how the influence of race or gender influenced model outputs. CONCLUSIONS: Among studies evaluating LLM health care use cases, the scope of clinical investigations was limited, with most studies showing redundant use cases. Because of infrequently reported descriptions of prompting strategies, incomplete model specifications, failure to disclose exact model outputs, and limited attempts to address bias, methodological inconsistency was concerningly extensive. CLINICAL RELEVANCE: A comprehensive understanding of current LLM use cases is critical to familiarize providers with the possibilities through which this technology may be used in clinical practice. As LLM health care applications transition from research to clinical integration, model transparency and trustworthiness is critical. The results of the current study suggest that guidance is urgently needed, with focus on promoting appropriate methodological conduct practices and novel use cases to advance the field.",Kunze KN; Gerhold C; Dave U; Abunnur N; Mamonov A; Nwachukwu BU; Verma NN; Chahla J
40015549,"An alternative approach to code, store, and regenerate 3D data in dental medicine using open-source software: A scripting-based technique.",2025,Journal of dentistry,,,,"PURPOSE: To develop a scripting-based technique for managing three-dimensional (3D) dental data and evaluate the regenerated standard tessellation language (STL) data in terms of file size, accuracy (trueness and precision), and processing time. MATERIALS AND METHODS: Ten STL dental and maxillofacial models were obtained from various imaging technologies, including intraoral scanners, computer-aided design (CAD) software, and cone-beam computed tomography (CBCT), and saved as STL files. ChatGPT was used to generate Python scripts in Blender for mesh simplification and data compression, which were then saved as .py files. The models were regenerated from these scripts in Blender, and their accuracy was assessed using GOM Inspect software, comparing trueness and precision. Statistical analysis, including Kruskal-Wallis and Mann-Whitney tests, was conducted to evaluate differences in file sizes between the original, Python-generated, and regenerated STL files, with statistical analyses performed at a level of significance alpha=0.05. RESULTS: The scripting-based technique was successfully utilized in ChatGPT to generate Python script code for accessing comprehensive data on STL models, utilizing Blender's scripting functionality. This approach enabled the generation, regeneration, and visualization of STL models, resulting in significantly smaller file sizes for both the Python script and regenerated STL files compared to the original STL files (p < 0.001). No significant differences in trueness were observed, with deviations ranging from 0.0 microm to 6.8 microm, and all regenerated STL models demonstrated perfect precision. Additionally, a proportional relationship was noted between the original STL file sizes and processing times. CONCLUSIONS: The scripting-based approach proved to be effective in coding, storing, and regenerating STL dental data with reduced file sizes and efficient processing times without compromising the accuracy. CLINICAL SIGNIFICANCE: Various STL dental models of patients can be coded, stored, and regenerated to be used again within efficient processing time without affecting the accuracy.",Elbashti ME; Paz-Cortes MM; Giovannini G; Acero-Sanz J; Abou-Ayash S; Cakmak G; Molinero-Mourelle P
40357425,Performance analysis of an emergency triage system in ophthalmology using a customized CHATBOT.,2025,Digital health,,,,"PURPOSE: To evaluate the performance of a custom ChatGPT-based chatbot in triaging ophthalmic emergencies compared to trained ophthalmologists. METHODS: One hundred hypothetical ophthalmic cases were created based on actual patient data from an ophthalmic emergency department, including details such as age, symptoms and medical history. Three experienced ophthalmologists independently graded these cases using a four-tier severity scale, ranging from Grade 1 (immediate care required) to Grade 4 (non-urgent care). A customized version of ChatGPT was developed to perform the same grading task. Inter-rater agreement was measured between the chatbot and the ophthalmologists, as well as among all human graders. RESULTS: The chatbot demonstrated substantial agreement with the ophthalmologists, achieving Cohen's kappa scores of 0.737, 0.749 and 0.751, respectively. The highest agreement was between ophthalmologist 3 and the chatbot (kappa = 0.751). Fleiss' kappa for overall agreement among all graders was 0.79, indicating substantial agreement. The Kruskal-Wallis test showed no statistically significant differences in the distribution of grades assigned by the chatbot and the ophthalmologists (p = 0.967). Bootstrap analysis revealed no significant difference in kappa values between the chatbot and human graders (p = 0.572, 95% CI -0.163 to 0.072). CONCLUSIONS: The study demonstrates that a customized chatbot can perform ophthalmic triage with a level of accuracy comparable to that of trained ophthalmologists. This suggests that AI-assisted triage could be a valuable tool in emergency departments, potentially enhancing clinical workflows and reducing waiting times while maintaining high standards of patient care.",Schumacher I; Ferro Desideri L; Buhler VMM; Sagurski N; Subhi Y; Bhardwaj G; Roth J; Anguita R
37536678,ChatGPT for Sample-Size Calculation in Sports Medicine and Exercise Sciences: A Cautionary Note.,2023,International journal of sports physiology and performance,,,,"PURPOSE: To investigate the accuracy of ChatGPT (Chat generative pretrained transformer), a large language model, in calculating sample size for sport-sciences and sports-medicine research studies. METHODS: We conducted an analysis on 4 published papers (ie, examples 1-4) encompassing various study designs and approaches for calculating sample size in 3 sport-science and -medicine journals, including 3 randomized controlled trials and 1 survey paper. We provided ChatGPT with all necessary data such as mean, percentage SD, normal deviates (Zalpha/2 and Z1-beta), and study design. Prompting from 1 example has subsequently been reused to gain insights into the reproducibility of the ChatGPT response. RESULTS: ChatGPT correctly calculated the sample size for 1 randomized controlled trial but failed in the remaining 3 examples, including the incorrect identification of the formula in one example of a survey paper. After interaction with ChatGPT, the correct sample size was obtained for the survey paper. Intriguingly, when the prompt from Example 3 was reused, ChatGPT provided a completely different sample size than its initial response. CONCLUSIONS: While the use of artificial-intelligence tools holds great promise, it should be noted that it might lead to errors and inconsistencies in sample-size calculations even when the tool is fed with the necessary correct information. As artificial-intelligence technology continues to advance and learn from human feedback, there is hope for improvement in sample-size calculation and other research tasks. However, it is important for scientists to exercise caution in utilizing these tools. Future studies should assess more advanced/powerful versions of this tool (ie, ChatGPT4).",Methnani J; Latiri I; Dergaa I; Chamari K; Ben Saad H
37553552,Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information.,2023,"Knee surgery, sports traumatology, arthroscopy : official journal of the ESSKA",,,,"PURPOSE: To investigate the potential use of large language models (LLMs) in orthopaedics by presenting queries pertinent to anterior cruciate ligament (ACL) surgery to generative pre-trained transformer (ChatGPT, specifically using its GPT-4 model of March 14th 2023). Additionally, this study aimed to evaluate the depth of the LLM's knowledge and investigate its adaptability to different user groups. It was hypothesized that the ChatGPT would be able to adapt to different target groups due to its strong language understanding and processing capabilities. METHODS: ChatGPT was presented with 20 questions and response was requested for two distinct target audiences: patients and non-orthopaedic medical doctors. Two board-certified orthopaedic sports medicine surgeons and two expert orthopaedic sports medicine surgeons independently evaluated the responses generated by ChatGPT. Mean correctness, completeness, and adaptability to the target audiences (patients and non-orthopaedic medical doctors) were determined. A three-point response scale facilitated nuanced assessment. RESULTS: ChatGPT exhibited fair accuracy, with average correctness scores of 1.69 and 1.66 (on a scale from 0, incorrect, 1, partially correct, to 2, correct) for patients and medical doctors, respectively. Three of the 20 questions (15.0%) were deemed incorrect by any of the four orthopaedic sports medicine surgeon assessors. Moreover, overall completeness was calculated to be 1.51 and 1.64 for patients and medical doctors, respectively, while overall adaptiveness was determined to be 1.75 and 1.73 for patients and doctors, respectively. CONCLUSION: Overall, ChatGPT was successful in generating correct responses in approximately 65% of the cases related to ACL surgery. The findings of this study imply that LLMs offer potential as a supplementary tool for acquiring orthopaedic knowledge. However, although ChatGPT can provide guidance and effectively adapt to diverse target audiences, it cannot supplant the expertise of orthopaedic sports medicine surgeons in diagnostic and treatment planning endeavours due to its limited understanding of orthopaedic domains and its potential for erroneous responses. LEVEL OF EVIDENCE: V.",Kaarre J; Feldt R; Keeling LE; Dadoo S; Zsidai B; Hughes JD; Samuelsson K; Musahl V
37917165,Artificial intelligence chatbots as sources of patient education material for obstructive sleep apnoea: ChatGPT versus Google Bard.,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"PURPOSE: To perform the first head-to-head comparative evaluation of patient education material for obstructive sleep apnoea generated by two artificial intelligence chatbots, ChatGPT and its primary rival Google Bard. METHODS: Fifty frequently asked questions on obstructive sleep apnoea in English were extracted from the patient information webpages of four major sleep organizations and categorized as input prompts. ChatGPT and Google Bard responses were selected and independently rated using the Patient Education Materials Assessment Tool-Printable (PEMAT-P) Auto-Scoring Form by two otolaryngologists, with a Fellowship of the Royal College of Surgeons (FRCS) and a special interest in sleep medicine and surgery. Responses were subjectively screened for any incorrect or dangerous information as a secondary outcome. The Flesch-Kincaid Calculator was used to evaluate the readability of responses for both ChatGPT and Google Bard. RESULTS: A total of 46 questions were curated and categorized into three domains: condition (n = 14), investigation (n = 9) and treatment (n = 23). Understandability scores for ChatGPT versus Google Bard on the various domains were as follows: condition 90.86% vs.76.32% (p < 0.001); investigation 89.94% vs. 71.67% (p < 0.001); treatment 90.78% vs.73.74% (p < 0.001). Actionability scores for ChatGPT versus Google Bard on the various domains were as follows: condition 77.14% vs. 51.43% (p < 0.001); investigation 72.22% vs. 54.44% (p = 0.05); treatment 73.04% vs. 54.78% (p = 0.002). The mean Flesch-Kincaid Grade Level for ChatGPT was 9.0 and Google Bard was 5.9. No incorrect or dangerous information was identified in any of the generated responses from both ChatGPT and Google Bard. CONCLUSION: Evaluation of ChatGPT and Google Bard patient education material for OSA indicates the former to offer superior information across several domains.",Cheong RCT; Unadkat S; Mcneillis V; Williamson A; Joseph J; Randhawa P; Andrews P; Paleri V
38925234,The Large Language Model ChatGPT-4 Exhibits Excellent Triage Capabilities and Diagnostic Performance for Patients Presenting With Various Causes of Knee Pain.,2025,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"PURPOSE: To provide a proof-of-concept analysis of the appropriateness and performance of ChatGPT-4 to triage, synthesize differential diagnoses, and generate treatment plans concerning common presentations of knee pain. METHODS: Twenty knee complaints warranting triage and expanded scenarios were input into ChatGPT-4, with memory cleared prior to each new input to mitigate bias. For the 10 triage complaints, ChatGPT-4 was asked to generate a differential diagnosis that was graded for accuracy and suitability in comparison to a differential created by 2 orthopaedic sports medicine physicians. For the 10 clinical scenarios, ChatGPT-4 was prompted to provide treatment guidance for the patient, which was again graded. To test the higher-order capabilities of ChatGPT-4, further inquiry into these specific management recommendations was performed and graded. RESULTS: All ChatGPT-4 diagnoses were deemed appropriate within the spectrum of potential pathologies on a differential. The top diagnosis on the differential was identical between surgeons and ChatGPT-4 for 70% of scenarios, and the top diagnosis provided by the surgeon appeared as either the first or second diagnosis in 90% of scenarios. Overall, 16 of 30 diagnoses (53.3%) in the differential were identical. When provided with 10 expanded vignettes with a single diagnosis, the accuracy of ChatGPT-4 increased to 100%, with the suitability of management graded as appropriate in 90% of cases. Specific information pertaining to conservative management, surgical approaches, and related treatments was appropriate and accurate in 100% of cases. CONCLUSIONS: ChatGPT-4 provided clinically reasonable diagnoses to triage patient complaints of knee pain due to various underlying conditions that were generally consistent with differentials provided by sports medicine physicians. Diagnostic performance was enhanced when providing additional information, allowing ChatGPT-4 to reach high predictive accuracy for recommendations concerning management and treatment options. However, ChatGPT-4 may show clinically important error rates for diagnosis depending on prompting strategy and information provided; therefore, further refinements are necessary prior to implementation into clinical workflows. CLINICAL RELEVANCE: Although ChatGPT-4 is increasingly being used by patients for health information, the potential for ChatGPT-4 to serve as a clinical support tool is unclear. In this study, we found that ChatGPT-4 was frequently able to diagnose and triage knee complaints appropriately as rated by sports medicine surgeons, suggesting that it may eventually be a useful clinical support tool.",Kunze KN; Varady NH; Mazzucco M; Lu AZ; Chahla J; Martin RK; Ranawat AS; Pearle AD; Williams RJ 3rd
39521391,Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine.,2025,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"PURPOSE: To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case. METHODS: A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models (LLama3 8b/70b and Mistral 8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response. RESULTS: All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI's GPT4 (95%). CONCLUSIONS: RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis. CLINICAL RELEVANCE: Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.",Woo JJ; Yang AJ; Olsen RJ; Hasan SS; Nawabi DH; Nwachukwu BU; Williams RJ 3rd; Ramkumar PN
38686158,Ethical and Professional Decision-Making Capabilities of Artificial Intelligence Chatbots: Evaluating ChatGPT's Professional Competencies in Medicine.,2024,Medical science educator,,,,"PURPOSE: We examined the performance of artificial intelligence chatbots on the PREview Practice Exam, an online situational judgment test for professionalism and ethics. METHODS: We used validated methodologies to calculate scores and descriptive statistics, chi(2) tests, and Fisher's exact tests to compare scores by model and competency. RESULTS: GPT-3.5 and GPT-4 scored 6/9 (76th percentile) and 7/9 (92nd percentile), respectively, higher than medical school applicant averages of 5/9 (56th percentile). Both models answered 95 + % of questions correctly. CONCLUSIONS: Chatbots outperformed the average applicant on PREview, suggesting their potential for healthcare training and decision-making and highlighting risks of online assessment delivery.",Lin JC; Kurapati SS; Younessi DN; Scott IU; Gong DA
38365990,A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases.,2024,European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery,,,,"PURPOSE: With recent advances in artificial intelligence (AI), it has become crucial to thoroughly evaluate its applicability in healthcare. This study aimed to assess the accuracy of ChatGPT in diagnosing ear, nose, and throat (ENT) pathology, and comparing its performance to that of medical experts. METHODS: We conducted a cross-sectional comparative study where 32 ENT cases were presented to ChatGPT 3.5, ENT physicians, ENT residents, family medicine (FM) specialists, second-year medical students (Med2), and third-year medical students (Med3). Each participant provided three differential diagnoses. The study analyzed diagnostic accuracy rates and inter-rater agreement within and between participant groups and ChatGPT. RESULTS: The accuracy rate of ChatGPT was 70.8%, being not significantly different from ENT physicians or ENT residents. However, a significant difference in correctness rate existed between ChatGPT and FM specialists (49.8%, p < 0.001), and between ChatGPT and medical students (Med2 47.5%, p < 0.001; Med3 47%, p < 0.001). Inter-rater agreement for the differential diagnosis between ChatGPT and each participant group was either poor or fair. In 68.75% of cases, ChatGPT failed to mention the most critical diagnosis. CONCLUSIONS: ChatGPT demonstrated accuracy comparable to that of ENT physicians and ENT residents in diagnosing ENT pathology, outperforming FM specialists, Med2 and Med3. However, it showed limitations in identifying the most critical diagnosis.",Makhoul M; Melkane AE; Khoury PE; Hadi CE; Matar N
38760650,Evaluation of the Impact of ChatGPT on the Selection of Surgical Technique in Bariatric Surgery.,2025,Obesity surgery,,,,"PURPOSE: With the growing interest in artificial intelligence (AI) applications in medicine, this study explores ChatGPT's potential to influence surgical technique selection in metabolic and bariatric surgery (MBS), contrasting AI recommendations with established clinical guidelines and expert consensus. MATERIALS AND METHODS: Conducting a single-center retrospective analysis, the study involved 161 patients who underwent MBS between January 2022 and December 2023. ChatGPT4 was used to analyze patient data, including demographics, pathological history, and BMI, to recommend the most suitable surgical technique. These AI recommendations were then compared with the hospital's algorithm-based decisions. RESULTS: ChatGPT recommended Roux-en-Y gastric bypass in over half of the cases. However, a significant difference was observed between AI suggestions and actual surgical techniques applied, with only a 34.16% match rate. Further analysis revealed any significant correlation between ChatGPT recommendations and the established surgical algorithm. CONCLUSION: Despite ChatGPT's ability to process and analyze large datasets, its recommendations for MBS techniques do not align closely with those determined by expert surgical teams using a high success rate algorithm. Consequently, the study concludes that ChatGPT4 should not replace expert consultation in selecting MBS techniques.",Lopez-Gonzalez R; Sanchez-Cordero S; Pujol-Gebelli J; Castellvi J
38527823,"Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts.",2024,Annals of family medicine,,,,"PURPOSE: Worldwide clinical knowledge is expanding rapidly, but physicians have sparse time to review scientific literature. Large language models (eg, Chat Generative Pretrained Transformer [ChatGPT]), might help summarize and prioritize research articles to review. However, large language models sometimes ""hallucinate"" incorrect information. METHODS: We evaluated ChatGPT's ability to summarize 140 peer-reviewed abstracts from 14 journals. Physicians rated the quality, accuracy, and bias of the ChatGPT summaries. We also compared human ratings of relevance to various areas of medicine to ChatGPT relevance ratings. RESULTS: ChatGPT produced summaries that were 70% shorter (mean abstract length of 2,438 characters decreased to 739 characters). Summaries were nevertheless rated as high quality (median score 90, interquartile range [IQR] 87.0-92.5; scale 0-100), high accuracy (median 92.5, IQR 89.0-95.0), and low bias (median 0, IQR 0-7.5). Serious inaccuracies and hallucinations were uncommon. Classification of the relevance of entire journals to various fields of medicine closely mirrored physician classifications (nonlinear standard error of the regression [SER] 8.6 on a scale of 0-100). However, relevance classification for individual articles was much more modest (SER 22.3). CONCLUSIONS: Summaries generated by ChatGPT were 70% shorter than mean abstract length and were characterized by high quality, high accuracy, and low bias. Conversely, ChatGPT had modest ability to classify the relevance of articles to medical specialties. We suggest that ChatGPT can help family physicians accelerate review of the scientific literature and have developed software (pyJournalWatch) to support this application. Life-critical medical decisions should remain based on full, critical, and thoughtful evaluation of the full text of research articles in context with clinical guidelines.",Hake J; Crowley M; Coy A; Shanks D; Eoff A; Kirmer-Voss K; Dhanda G; Parente DJ
40064613,Unveiling the power of R: a comprehensive perspective for laboratory medicine data analysis.,2025,Clinical chemistry and laboratory medicine,,,,"R language has gained traction in laboratory medicine for its statistical power and dynamic tools like RMarkdown and RShiny. However, there is limited literature summarizing R packages and functions tailored for laboratory medicine, making it difficult for clinical laboratory workers to access these tools. Additionally, varying algorithms across R packages can lead to inconsistencies in published reports. This review addresses these challenges by providing an overview of R's evolution and its key features, followed by a summary of statistical methods implemented in R, including platform comparisons, precision verification, factor analysis, and the establishment of reference intervals (RIs). We also highlight the development and validation of predictive models using techniques such as linear and logistic regression, decision trees, random forests, support vector machines, naive Bayes, K-Nearest Neighbors, k-means clustering, and backpropagation neural networks - all implemented in R. To ensure transparency and reproducibility in research, a checklist is provided for authors publishing papers using R for data analysis in laboratory medicine. In the final section, the potential of R in big data analytics is explored, focusing on standardized reporting through RMarkdown and the creation of user-friendly data visualization platforms with RShiny. Moreover, the integration of large language models (LLMs), such as ChatGPT, is discussed for their benefits in enhancing R programming, automating reporting, and offering insights from data analysis, thus improving the efficiency and accuracy of laboratory data analysis.",Ma C; Qiu L
37758604,Radiology Reading Room for the Future: Harnessing the Power of Large Language Models Like ChatGPT.,2023,Current problems in diagnostic radiology,,,,"Radiology has usually been the field of medicine that has been at the forefront of technological advances, often being the first to wholeheartedly embrace them. Whether it's from digitization to cloud side architecture, radiology has led the way for adopting the latest advances. With the advent of large language models (LLMs), especially with the unprecedented explosion of freely available ChatGPT, time is ripe for radiology and radiologists to find novel ways to use the technology to improve their workflow. Towards this, we believe these LLMs have a key role in the radiology reading room not only to expedite processes, simplify mundane and archaic tasks, but also to increase the radiologist's and radiologist trainee's knowledge base at a far faster pace. In this article, we discuss some of the ways we believe ChatGPT, and the likes can be harnessed in the reading room.",Tippareddy C; Jiang S; Bera K; Ramaiya N
38160089,Could ChatGPT Pass the UK Radiology Fellowship Examinations?,2024,Academic radiology,,,,"RATIONALE AND OBJECTIVES: Chat Generative Pre-trained Transformer (ChatGPT) is an artificial intelligence (AI) tool which utilises machine learning to generate original text resembling human language. AI models have recently demonstrated remarkable ability at analysing and solving problems, including passing professional examinations. We investigate the performance of ChatGPT on some of the UK radiology fellowship equivalent examination questions. METHODS: ChatGPT was asked to answer questions from question banks resembling the Fellowship of the Royal College of Radiologists (FRCR) examination. The entire physics part 1 question bank (203 5-part true/false questions) was answered by the GPT-4 model and answers recorded. 240 single best answer questions (SBAs) (representing the true length of the FRCR 2A examination) were answered by both GPT-3.5 and GPT-4 models. RESULTS: ChatGPT 4 answered 74.8% of part 1 true/false statements correctly. The spring 2023 passing mark of the part 1 examination was 75.5% and ChatGPT thus narrowly failed. In the 2A examination, ChatGPT 3.5 answered 50.8% SBAs correctly, while GPT-4 answered 74.2% correctly. The winter 2022 2A pass mark was 63.3% and thus GPT-4 clearly passed. CONCLUSION: AI models such as ChatGPT are able to answer the majority of questions in an FRCR style examination. It is reasonable to assume that further developments in AI will be more likely to succeed in comprehending and solving questions related to medicine, specifically clinical radiology. ADVANCES IN KNOWLEDGE: Our findings outline the unprecedented capabilities of AI, adding to the current relatively small body of literature on the subject, which in turn can play a role medical training, evaluation and practice. This can undoubtedly have implications for radiology.",Ariyaratne S; Jenko N; Mark Davies A; Iyengar KP; Botchu R
37492829,"The Emerging Role of Generative Artificial Intelligence in Medical Education, Research, and Practice.",2023,Cureus,,,,"Recent breakthroughs in generative artificial intelligence (GAI) and the emergence of transformer-based large language models such as Chat Generative Pre-trained Transformer (ChatGPT) have the potential to transform healthcare education, research, and clinical practice. This article examines the current trends in using GAI models in medicine, outlining their strengths and limitations. It is imperative to develop further consensus-based guidelines to govern the appropriate use of GAI, not only in medical education but also in research, scholarship, and clinical practice.",Shoja MM; Van de Ridder JMM; Rajput V
37407364,Pearls and pitfalls of ChatGPT in medical oncology.,2023,Trends in cancer,,,,"Recently, ChatGPT has drawn attention to the potential uses of artificial intelligence (AI) in academia. Here, we discuss how ChatGPT can be of value to medicine and medical oncology and the potential pitfalls that may be encountered.",Blum J; Menta AK; Zhao X; Yang VB; Gouda MA; Subbiah V
40055532,Red teaming ChatGPT in medicine to yield real-world insights on model behavior.,2025,NPJ digital medicine,,,,"Red teaming, the practice of adversarially exposing unexpected or undesired model behaviors, is critical towards improving equity and accuracy of large language models, but non-model creator-affiliated red teaming is scant in healthcare. We convened teams of clinicians, medical and engineering students, and technical professionals (80 participants total) to stress-test models with real-world clinical cases and categorize inappropriate responses along axes of safety, privacy, hallucinations/accuracy, and bias. Six medically-trained reviewers re-analyzed prompt-response pairs and added qualitative annotations. Of 376 unique prompts (1504 responses), 20.1% were inappropriate (GPT-3.5: 25.8%; GPT-4.0: 16%; GPT-4.0 with Internet: 17.8%). Subsequently, we show the utility of our benchmark by testing GPT-4o, a model released after our event (20.4% inappropriate). 21.5% of responses appropriate with GPT-3.5 were inappropriate in updated models. We share insights for constructing red teaming prompts, and present our benchmark for iterative model assessments.",Chang CT; Farah H; Gui H; Rezaei SJ; Bou-Khalil C; Park YJ; Swaminathan A; Omiye JA; Kolluri A; Chaurasia A; Lozano A; Heiman A; Jia AS; Kaushal A; Jia A; Iacovelli A; Yang A; Salles A; Singhal A; Narasimhan B; Belai B; Jacobson BH; Li B; Poe CH; Sanghera C; Zheng C; Messer C; Kettud DV; Pandya D; Kaur D; Hla D; Dindoust D; Moehrle D; Ross D; Chou E; Lin E; Haredasht FN; Cheng G; Gao I; Chang J; Silberg J; Fries JA; Xu J; Jamison J; Tamaresis JS; Chen JH; Lazaro J; Banda JM; Lee JJ; Matthys KE; Steffner KR; Tian L; Pegolotti L; Srinivasan M; Manimaran M; Schwede M; Zhang M; Nguyen M; Fathzadeh M; Zhao Q; Bajra R; Khurana R; Azam R; Bartlett R; Truong ST; Fleming SL; Raj S; Behr S; Onyeka S; Muppidi S; Bandali T; Eulalio TY; Chen W; Zhou X; Ding Y; Cui Y; Tan Y; Liu Y; Shah N; Daneshjou R
38570021,A framework enabling LLMs into regulatory environment for transparency and trustworthiness and its application to drug labeling document.,2024,Regulatory toxicology and pharmacology : RTP,,,,"Regulatory agencies consistently deal with extensive document reviews, ranging from product submissions to both internal and external communications. Large Language Models (LLMs) like ChatGPT can be invaluable tools for these tasks, however present several challenges, particularly the proprietary information, combining customized function with specific review needs, and transparency and explainability of the model's output. Hence, a localized and customized solution is imperative. To tackle these challenges, we formulated a framework named askFDALabel on FDA drug labeling documents that is a crucial resource in the FDA drug review process. AskFDALabel operates within a secure IT environment and comprises two key modules: a semantic search and a Q&A/text-generation module. The Module S built on word embeddings to enable comprehensive semantic queries within labeling documents. The Module T utilizes a tuned LLM to generate responses based on references from Module S. As the result, our framework enabled small LLMs to perform comparably to ChatGPT with as a computationally inexpensive solution for regulatory application. To conclude, through AskFDALabel, we have showcased a pathway that harnesses LLMs to support agency operations within a secure environment, offering tailored functions for the needs of regulatory research.",Wu L; Xu J; Thakkar S; Gray M; Qu Y; Li D; Tong W
37434733,ChatGPT's potential role in non-English-speaking outpatient clinic settings.,2023,Digital health,,,,"Researchers recently utilized ChatGPT as a tool for composing clinic letters, highlighting its ability to generate accurate and empathetic communications. Here we demonstrated the potential application of ChatGPT as a medical assistant in Mandarin Chinese-speaking outpatient clinics, aiming to improve patient satisfaction in high-patient volume settings. ChatGPT achieved an average score of 72.4% in the Chinese Medical Licensing Examination's Clinical Knowledge section, ranking within the top 20th percentile. It also demonstrated its potential for clinical communication in non-English speaking environments. Our study suggests that ChatGPT could serve as an interface between physicians and patients in Chinese-speaking outpatient settings, possibly extending to other languages. However, further optimization is required, including training on medical-specific datasets, rigorous testing, privacy compliance, integration with existing systems, user-friendly interfaces, and the development of guidelines for medical professionals. Controlled clinical trials and regulatory approval are necessary before widespread implementation. As chatbots' integration into medical practice becomes more feasible, rigorous early investigations and pilot studies can help mitigate potential risks.",Zhu Z; Ying Y; Zhu J; Wu H
37802491,ChatGPT in medical research: challenging time ahead.,2023,The Medico-legal journal,,,,"Since its launch, ChatGPT, an artificial intelligence-powered language model tool, has generated significant attention in research writing. The use of ChatGPT in medical research can be a double-edged sword. ChatGPT can expedite the research writing process by assisting with hypothesis formulation, literature review, data analysis and manuscript writing. On the other hand, using ChatGPT raises concerns regarding the originality and authenticity of content, the precision and potential bias of the tool's output, and the potential legal issues associated with privacy, confidentiality and plagiarism. The article also calls for adherence to stringent citation guidelines and the development of regulations promoting the responsible application of AI. Despite the revolutionary capabilities of ChatGPT, the article highlights its inability to replicate human thought and the difficulties in maintaining the integrity and reliability of ChatGPT-enabled research, particularly in complex fields such as medicine and law. AI tools can be used as supplementary aids rather than primary sources of analysis in medical research writing.",Bhargava DC; Jadav D; Meshram VP; Kanchan T
37984563,Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers.,2024,Clinical gastroenterology and hepatology : the official clinical practice journal of the American Gastroenterological Association,,,,"Since its release in 2022, Chat Generative Pre-Trained Transformer (ChatGPT) became the most rapidly expanding consumer software application in history,(1) and its role in medicine is underscored by its potential to enhance patient education and physician-patient communication. Previous studies in gastroenterology and hepatology have focused primarily on the earlier Generative Pre-Trained Transformer 3 (GPT-3) model, with none investigating ChatGPT's ability to generate supportive references for its responses, or its applicability as a physician educational tool.(2-6) Our study evaluated the accuracy of the more recent ChatGPT, powered by GPT-4, in addressing frequently asked questions by patients on irritable bowel syndrome (IBS), inflammatory bowel disease (IBD), colonoscopy and colorectal cancer (CRC) screening, questions on CRC screening from a physician perspective, and reference generation and suitability.",Kerbage A; Kassab J; El Dahdah J; Burke CA; Achkar JP; Rouphael C
38144348,A systematic review and meta-analysis on ChatGPT and its utilization in medical and dental research.,2023,Heliyon,,,,"Since its release, ChatGPT has taken the world by storm with its utilization in various fields of life. This review's main goal was to offer a thorough and fact-based evaluation of ChatGPT's potential as a tool for medical and dental research, which could direct subsequent research and influence clinical practices. METHODS: Different online databases were scoured for relevant articles that were in accordance with the study objectives. A team of reviewers was assembled to devise a proper methodological framework for inclusion of articles and meta-analysis. RESULTS: 11 descriptive studies were considered for this review that evaluated the accuracy of ChatGPT in answering medical queries related to different domains such as systematic reviews, cancer, liver diseases, diagnostic imaging, education, and COVID-19 vaccination. The studies reported different accuracy ranges, from 18.3 % to 100 %, across various datasets and specialties. The meta-analysis showed an odds ratio (OR) of 2.25 and a relative risk (RR) of 1.47 with a 95 % confidence interval (CI), indicating that the accuracy of ChatGPT in providing correct responses was significantly higher compared to the total responses for queries. However, significant heterogeneity was present among the studies, suggesting considerable variability in the effect sizes across the included studies. CONCLUSION: The observations indicate that ChatGPT has the ability to provide appropriate solutions to questions in the medical and dentistry areas, but researchers and doctors should cautiously assess its responses because they might not always be dependable. Overall, the importance of this study rests in shedding light on ChatGPT's accuracy in the medical and dentistry fields and emphasizing the need for additional investigation to enhance its performance. (c) 2017 Elsevier Inc. All rights reserved.",Bagde H; Dhopte A; Alam MK; Basri R
39039954,ChatGPT and dermatology.,2024,Italian journal of dermatology and venereology,,,,"Since the development of the artificial intelligence (AI), several applications have been proposed. Among these, the intersection of AI and medicine has sparked a wave of innovation, revolutionizing various aspects of healthcare delivery, diagnosis, and treatment. A review of the current literature was performed to evaluate the possible applications of ChatGPT (OpenAI, San Francisco, CA, USA) in the dermatological field. A total of 20 manuscripts were collected in the present review, reporting several potential applications of ChatGPT in dermatology, ranging from clinical practice to patients' support. The convergence of ChatGPT and dermatology represents a compelling synergy that transcends traditional boundaries of healthcare delivery. As AI continues to evolve and permeate every facet of medicine, ChatGPT stands at the forefront of innovation, empowering patients and clinicians alike with its conversational prowess and knowledge dissemination capabilities.",D'Agostino M; Feo F; Martora F; Genco L; Megna M; Cacciapuoti S; Villani A; Potestio L
40259576,Can ChatGPT detect breast cancer on mammography?,2025,Journal of medical screening,,,,"Some noteworthy studies have questioned the use of ChatGPT, a free artificial intelligence program that has become very popular and widespread in recent times, in different branches of medicine. In this study, the success of ChatGPT in detecting breast cancer on mammography (MMG) was evaluated. The pre-treatment mammographic images of patients with a histopathological diagnosis of invasive breast carcinoma and prominent mass formation on MMG were read separately into two ChatGPT subprograms: Radiologist Report Writer (P1) and XrayGPT (P2). The programs were asked to determine mammographic breast density, tumor size, side, and quadrant, the presence of microcalcification, distortion, skin or nipple changes, and axillary lymphadenopathy (LAP), and BI-RADS score. The responses were evaluated in consensus by two experienced radiologists. Although the mass detection rate of both programs was over 60%, the success in determining breast density, tumor size and localization, microcalcification, distortion, skin or nipple changes, and axillary LAP was low. BI-RADS category agreement with readers was fair for P1 (kappa:28%, 0.20< kappa </= 0.40) and moderate for P2 (kappa:58%, 0.40< kappa </= 0.60). In conclusion, while the XrayGPT application can detect breast cancer with a mass appearance on MMG images better than the Radiologist Report Writer application, the success of both is low in detecting all other related features. This casts doubt over the suitability of current large language models for image analysis in breast screening.",Tekcan Sanli DE; Sanli AN; Yildirim D; Dogan I
40055086,Large Language Models in peri-implant disease: How well do they perform?,2025,The Journal of prosthetic dentistry,,,,"STATEMENT OF PROBLEM: Artificial intelligence (AI) has gained significant recent attention and several AI applications, such as the Large Language Models (LLMs) are promising for use in clinical medicine and dentistry. Nevertheless, assessing the performance of LLMs is essential to identify potential inaccuracies or even prevent harmful outcomes. PURPOSE: The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to clinical questions in the field of implant dentistry. MATERIAL AND METHODS: A total of 10 open-ended questions pertinent to prevention and treatment of peri-implant disease were posed to 4 distinct LLMs including ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers were evaluated independently by 2 periodontists against scientific evidence for comprehensiveness, scientific accuracy, clarity, and relevance. The LLMs responses received scores ranging from 0 (minimum) to 10 (maximum) points. To assess the intra-evaluator reliability, a re-evaluation of the LLM responses was performed after 2 weeks and Cronbach alpha and interclass correlation coefficient (ICC) was used (alpha=.05). RESULTS: The scores assigned by the examiners on the 2 occasions were not statistically different and each LLM received an average score. Google Gemini Advanced ranked higher than the rest of the LLMs, while Google Gemini scored worst. The difference between Google Gemini Advanced and Google Gemini was statistically significantly different (P=.005). CONCLUSIONS: Dental professionals need to be cautious when using LLMs to access content related to peri-implant diseases. LLMs cannot currently replace dental professionals and caution should be exercised when used in patient care.",Koidou VP; Chatzopoulos GS; Tsalikis L; Kaklamanos EG
37811040,ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references - a preliminary study.,2023,Annals of medicine and surgery (2012),,,,"Stem cell research has the transformative potential to revolutionize medicine. Language models like ChatGPT, which use artificial intelligence (AI) and natural language processing, generate human-like text that can aid researchers. However, it is vital to ensure the accuracy and reliability of AI-generated references. This study assesses Chat Generative Pre-Trained Transformer (ChatGPT)'s utility in stem cell research and evaluates the accuracy of its references. Of the 86 references analyzed, 15.12% were fabricated and 9.30% were erroneous. These errors were due to limitations such as no real-time internet access and reliance on preexisting data. Artificial hallucinations were also observed, where the text seems plausible but deviates from fact. Monitoring, diverse training, and expanding knowledge cut-off can help to reduce fabricated references and hallucinations. Researchers must verify references and consider the limitations of AI models. Further research is needed to enhance the accuracy of such language models. Despite these challenges, ChatGPT has the potential to be a valuable tool for stem cell research. It can help researchers to stay up-to-date on the latest developments in the field and to find relevant information.",Sharun K; Banu SA; Pawde AM; Kumar R; Akash S; Dhama K; Pal A
37025739,Toxic Epidermal Necrolysis in a Critically Ill African American Woman: A Case Report Written With ChatGPT Assistance.,2023,Cureus,,,,"Stevens-Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN) are life-threatening spectrum diseases in which a medication triggers a mucocutaneous reaction associated with severe necrosis and loss of epidermal integrity. The disease has a high mortality rate that can be assessed by dermatology scoring scales based on an affected total body surface area (TBSA). Sloughing of <10% TBSA is considered SJS, with a mortality of 10%. Sloughing of >30% TBSA is termed TEN, with an increased mortality rate of 25% to 35%. We present a case and management of TEN that involved >30% TBSA in a critically ill African American woman. Identification of the offending agent was difficult due to complicated medication exposure throughout her multi-facility care management. This case conveys the importance of close monitoring of a critically ill patient during a clinical course involving SJS-/TEN-inducing drugs. We also discuss the potential increased risks for SJS/TEN in the African American population due to genetic or epigenetic predispositions to skin conditions. This case report also contributes to increasing skin of color representation in the current literature. Additionally, we discuss the use of Chat Generative Pre-trained Transformer (ChatGPT, OpenAI LP, OpenAI Inc., San Francisco, CA, USA) and list its benefits and errors.",Lantz R
37248403,"Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing.",2023,La Radiologia medica,,,,Structured reporting may improve the radiological workflow and communication among physicians. Artificial intelligence applications in medicine are growing fast. Large language models (LLMs) are recently gaining importance as valuable tools in radiology and are currently being tested for the critical task of structured reporting. We compared four LLMs models in terms of knowledge on structured reporting and templates proposal. LLMs hold a great potential for generating structured reports in radiology but additional formal validations are needed on this topic.,Mallio CA; Sertorio AC; Bernetti C; Beomonte Zobel B
39928039,Appropriateness and Consistency of an Online Artificial Intelligence System's Response to Common Questions Regarding Cervical Fusion.,2025,Clinical spine surgery,,,,"STUDY DESIGN: Prospective survey study. OBJECTIVE: To address a gap that exists concerning ChatGPT's ability to respond to various types of questions regarding cervical surgery. SUMMARY OF BACKGROUND DATA: Artificial Intelligence (AI) and machine learning have been creating great change in the landscape of scientific research. Chat Generative Pre-trained Transformer(ChatGPT), an online AI language model, has emerged as a powerful tool in clinical medicine and surgery. Previous studies have demonstrated appropriate and reliable responses from ChatGPT concerning patient questions regarding total joint arthroplasty, distal radius fractures, and lumbar laminectomy. However, there is a gap that exists in examining how accurate and reliable ChatGPT responses are to common questions related to cervical surgery. MATERIALS AND METHODS: Twenty questions regarding cervical surgery were presented to the online ChatGPT-3.5 web application 3 separate times, creating 60 responses. Responses were then analyzed by 3 fellowship-trained spine surgeons across 2 institutions using a modified Global Quality Scale (1-5 rating) to evaluate accuracy and utility. Descriptive statistics were reported based on responses, and intraclass correlation coefficients were then calculated to assess the consistency of response quality. RESULTS: Out of all questions proposed to the AI platform, the average score was 3.17 (95% CI, 2.92, 3.42), with 66.7% of responses being recorded to be of at least ""moderate"" quality by 1 reviewer. Nine (45%) questions yielded responses that were graded at least ""moderate"" quality by all 3 reviewers. The test-retest reliability was poor with the intraclass correlation coefficient (ICC) calculated as 0.0941 (-0.222, 0.135). CONCLUSION: This study demonstrated that ChatGPT can answer common patient questions concerning cervical surgery with moderate quality during the majority of responses. Further research within AI is necessary to increase response.",Miller M; DiCiurcio WT 3rd; Meade M; Buchan L; Gleimer J; Woods B; Kepler C
38531823,Chat Generative Pretraining Transformer Answers Patient-focused Questions in Cervical Spine Surgery.,2024,Clinical spine surgery,,,,"STUDY DESIGN: Review of Chat Generative Pretraining Transformer (ChatGPT) outputs to select patient-focused questions. OBJECTIVE: We aimed to examine the quality of ChatGPT responses to cervical spine questions. BACKGROUND: Artificial intelligence and its utilization to improve patient experience across medicine is seeing remarkable growth. One such usage is patient education. For the first time on a large scale, patients can ask targeted questions and receive similarly targeted answers. Although patients may use these resources to assist in decision-making, there still exists little data regarding their accuracy, especially within orthopedic surgery and more specifically spine surgery. METHODS: We compiled 9 frequently asked questions cervical spine surgeons receive in the clinic to test ChatGPT's version 3.5 ability to answer a nuanced topic. Responses were reviewed by 2 independent reviewers on a Likert Scale for the accuracy of information presented (0-5 points), appropriateness in giving a specific answer (0-3 points), and readability for a layperson (0-2 points). Readability was assessed through the Flesh-Kincaid grade level analysis for the original prompt and for a second prompt asking for rephrasing at the sixth-grade reading level. RESULTS: On average, ChatGPT's responses scored a 7.1/10. Accuracy was rated on average a 4.1/5. Appropriateness was 1.8/3. Readability was a 1.2/2. Readability was determined to be at the 13.5 grade level originally and at the 11.2 grade level after prompting. CONCLUSIONS: ChatGPT has the capacity to be a powerful means for patients to gain important and specific information regarding their pathologies and surgical options. These responses are limited in their accuracy, and we, in addition, noted readability is not optimal for the average patient. Despite these limitations in ChatGPT's capability to answer these nuanced questions, the technology is impressive, and surgeons should be aware patients will likely increasingly rely on it.",Subramanian T; Araghi K; Amen TB; Kaidi A; Sosa B; Shahi P; Qureshi S; Iyer S
38483426,Harnessing the Power of Generative AI for Clinical Summaries: Perspectives From Emergency Physicians.,2024,Annals of emergency medicine,,,,"STUDY OBJECTIVE: The workload of clinical documentation contributes to health care costs and professional burnout. The advent of generative artificial intelligence language models presents a promising solution. The perspective of clinicians may contribute to effective and responsible implementation of such tools. This study sought to evaluate 3 uses for generative artificial intelligence for clinical documentation in pediatric emergency medicine, measuring time savings, effort reduction, and physician attitudes and identifying potential risks and barriers. METHODS: This mixed-methods study was performed with 10 pediatric emergency medicine attending physicians from a single pediatric emergency department. Participants were asked to write a supervisory note for 4 clinical scenarios, with varying levels of complexity, twice without any assistance and twice with the assistance of ChatGPT Version 4.0. Participants evaluated 2 additional ChatGPT-generated clinical summaries: a structured handoff and a visit summary for a family written at an 8th grade reading level. Finally, a semistructured interview was performed to assess physicians' perspective on the use of ChatGPT in pediatric emergency medicine. Main outcomes and measures included between subjects' comparisons of the effort and time taken to complete the supervisory note with and without ChatGPT assistance. Effort was measured using a self-reported Likert scale of 0 to 10. Physicians' scoring of and attitude toward the ChatGPT-generated summaries were measured using a 0 to 10 Likert scale and open-ended questions. Summaries were scored for completeness, accuracy, efficiency, readability, and overall satisfaction. A thematic analysis was performed to analyze the content of the open-ended questions and to identify key themes. RESULTS: ChatGPT yielded a 40% reduction in time and a 33% decrease in effort for supervisory notes in intricate cases, with no discernible effect on simpler notes. ChatGPT-generated summaries for structured handoffs and family letters were highly rated, ranging from 7.0 to 9.0 out of 10, and most participants favored their inclusion in clinical practice. However, there were several critical reservations, out of which a set of general recommendations for applying ChatGPT to clinical summaries was formulated. CONCLUSION: Pediatric emergency medicine attendings in our study perceived that ChatGPT can deliver high-quality summaries while saving time and effort in many scenarios, but not all.",Barak-Corren Y; Wolf R; Rozenblum R; Creedon JK; Lipsett SC; Lyons TW; Michelson KA; Miller KA; Shapiro DJ; Reis BY; Fine AM
40265240,Artificial intelligence in sleep medicine: assessing the diagnostic precision of ChatGPT-4.,2025,Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine,,,,"STUDY OBJECTIVES: Large language models (LLMs) like ChatGPT-4 are emerging in medicine, including sleep medicine, where artificial intelligence (AI) is used to analyze sleep data and predict treatment outcomes. Effectiveness of LLM in accurately diagnosing sleep disorders based on clinical history has not yet been studied. This study evaluates ChatGPT-4's diagnostic performance using clinical vignettes. METHODS: Nineteen clinical vignettes containing patient history, physical examination findings, and diagnostic tests from the Case Book of Sleep Medicine (3rd ed., 2019, AASM) were presented to ChatGPT-4. Its differential and final diagnoses were compared to reference diagnoses, with accuracy assessed by (1) the percentage of correct differentials and (2) three-tier scoring system (no match, partial match, full match) for final diagnoses. RESULTS: The mean accuracy for differential diagnoses was 63.27% +/- 15.61% (SD), ranging from 33.33% to 100%. The mean number of AI-generated differential diagnoses matching the AASM case differential diagnoses was 2.79 +/- 0.71 (SD). For final diagnoses, ChatGPT-4 scored total of 30 out of a possible 38, resulting in an overall accuracy of 78.95%. The model achieved a mean score was 1.58 +/- 0.61 (SD) out of 2, with 68.42% of cases achieving a full match. Performance was higher in cases with fewer differential diagnoses, whereas accuracy decreased in complex cases. CONCLUSIONS: ChatGPT-4 demonstrates promising diagnostic potential in sleep medicine, with moderate to high accuracy in identifying differential and final diagnoses. Although its variability in more complex cases calls for refinement and clinical validation.",Patel A; Cheung J
38217478,Evaluating insomnia queries from an artificial intelligence chatbot for patient education.,2024,Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine,,,,"STUDY OBJECTIVES: We evaluated the accuracy of ChatGPT in addressing insomnia-related queries for patient education and assessed ChatGPT's ability to provide varied responses based on differing prompting scenarios. METHODS: Four identical sets of 20 insomnia-related queries were posed to ChatGPT. Each set differed by the context in which ChatGPT was prompted: no prompt, patient-centered, physician-centered, and with references and statistics. Responses were reviewed by 2 academic sleep surgeons, 1 academic sleep medicine physician, and 2 sleep medicine fellows across 4 domains: clinical accuracy, prompt adherence, referencing, and statistical precision, using a binary grading system. Flesch-Kincaid grade-level scores were calculated to estimate the grade level of the responses, with statistical differences between prompts analyzed via analysis of variance and Tukey's test. Interrater reliability was calculated using Fleiss's kappa. RESULTS: The study revealed significant variations in the Flesch-Kincaid grade-level scores across 4 prompts: unprompted (13.2 +/- 2.2), patient-centered (8.1 +/- 1.9), physician-centered (15.4 +/- 2.8), and with references and statistics (17.3 +/- 2.3, P < .001). Despite poor Fleiss kappa scores, indicating low interrater reliability for clinical accuracy and relevance, all evaluators agreed that the majority of ChatGPT's responses were clinically accurate, with the highest variability on Form 4. The responses were also uniformly relevant to the given prompts (100% agreement). Eighty percent of the references ChatGPT cited were verified as both real and relevant, and only 25% of cited statistics were corroborated within referenced articles. CONCLUSIONS: ChatGPT can be used to generate clinically accurate responses to insomnia-related inquiries. CITATION: Alapati R, Campbell D, Molin N, et al. Evaluating insomnia queries from an artificial intelligence chatbot for patient education. J Clin Sleep Med. 2024;20(4):583-594.",Alapati R; Campbell D; Molin N; Creighton E; Wei Z; Boon M; Huntley C
38723763,Text summarization with ChatGPT for drug labeling documents.,2024,Drug discovery today,,,,"Text summarization is crucial in scientific research, drug discovery and development, regulatory review, and more. This task demands domain expertise, language proficiency, semantic prowess, and conceptual skill. The recent advent of large language models (LLMs), such as ChatGPT, offers unprecedented opportunities to automate this process. We compared ChatGPT-generated summaries with those produced by human experts using FDA drug labeling documents. The labeling contains summaries of key labeling sections, making them an ideal human benchmark to evaluate ChatGPT's summarization capabilities. Analyzing >14000 summaries, we observed that ChatGPT-generated summaries closely resembled those generated by human experts. Importantly, ChatGPT exhibited even greater similarity when summarizing drug safety information. These findings highlight ChatGPT's potential to accelerate work in critical areas, including drug safety.",Ying L; Liu Z; Fang H; Kusko R; Wu L; Harris S; Tong W
39176909,Comparing a Large Language Model with Previous Deep Learning Models on Named Entity Recognition of Adverse Drug Events.,2024,Studies in health technology and informatics,,,,"The ability to fine-tune pre-trained deep learning models to learn how to process a downstream task using a large training set allow to significantly improve performances of named entity recognition. Large language models are recent models based on the Transformers architecture that may be conditioned on a new task with in-context learning, by providing a series of instructions or prompt. These models only require few examples and such approach is defined as few shot learning. Our objective was to compare performances of named entity recognition of adverse drug events between state of the art deep learning models fine-tuned on Pubmed abstracts and a large language model using few-shot learning. Hussain et al's state of the art model (PMID: 34422092) significantly outperformed the ChatGPT-3.5 model (F1-Score: 97.6% vs 86.0%). Few-shot learning is a convenient way to perform named entity recognition when training examples are rare, but performances are still inferior to those of a deep learning model fine-tuned with several training examples. Perspectives are to evaluate few-shot prompting with GPT-4 and perform fine-tuning on GPT-3.5.",Tiffet T; Pikaar A; Trombert-Paviot B; Jaulent MC; Bousquet C
38540976,Personalized Medicine Transformed: ChatGPT's Contribution to Continuous Renal Replacement Therapy Alarm Management in Intensive Care Units.,2024,Journal of personalized medicine,,,,"The accurate interpretation of CRRT machine alarms is crucial in the intensive care setting. ChatGPT, with its advanced natural language processing capabilities, has emerged as a tool that is evolving and advancing in its ability to assist with healthcare information. This study is designed to evaluate the accuracy of the ChatGPT-3.5 and ChatGPT-4 models in addressing queries related to CRRT alarm troubleshooting. This study consisted of two rounds of ChatGPT-3.5 and ChatGPT-4 responses to address 50 CRRT machine alarm questions that were carefully selected by two nephrologists in intensive care. Accuracy was determined by comparing the model responses to predetermined answer keys provided by critical care nephrologists, and consistency was determined by comparing outcomes across the two rounds. The accuracy rate of ChatGPT-3.5 was 86% and 84%, while the accuracy rate of ChatGPT-4 was 90% and 94% in the first and second rounds, respectively. The agreement between the first and second rounds of ChatGPT-3.5 was 84% with a Kappa statistic of 0.78, while the agreement of ChatGPT-4 was 92% with a Kappa statistic of 0.88. Although ChatGPT-4 tended to provide more accurate and consistent responses than ChatGPT-3.5, there was no statistically significant difference between the accuracy and agreement rate between ChatGPT-3.5 and -4. ChatGPT-4 had higher accuracy and consistency but did not achieve statistical significance. While these findings are encouraging, there is still potential for further development to achieve even greater reliability. This advancement is essential for ensuring the highest-quality patient care and safety standards in managing CRRT machine-related issues.",Sheikh MS; Thongprayoon C; Qureshi F; Suppadungsuk S; Kashani KB; Miao J; Craici IM; Cheungpasitporn W
40241839,Large language models in critical care.,2025,Journal of intensive medicine,,,,"The advent of chat generative pre-trained transformer (ChatGPT) and large language models (LLMs) has revolutionized natural language processing (NLP). These models possess unprecedented capabilities in understanding and generating human-like language. This breakthrough holds significant promise for critical care medicine, where unstructured data and complex clinical information are abundant. Key applications of LLMs in this field include administrative support through automated documentation and patient chart summarization; clinical decision support by assisting in diagnostics and treatment planning; personalized communication to enhance patient and family understanding; and improving data quality by extracting insights from unstructured clinical notes. Despite these opportunities, challenges such as the risk of generating inaccurate or biased information ""hallucinations"", ethical considerations, and the need for clinician artificial intelligence (AI) literacy must be addressed. Integrating LLMs with traditional machine learning models - an approach known as Hybrid AI - combines the strengths of both technologies while mitigating their limitations. Careful implementation, regulatory compliance, and ongoing validation are essential to ensure that LLMs enhance patient care rather than hinder it. LLMs have the potential to transform critical care practices, but integrating them requires caution. Responsible use and thorough clinician training are crucial to fully realize their benefits.",Biesheuvel LA; Workum JD; Reuland M; van Genderen ME; Thoral P; Dongelmans D; Elbers P
37130009,Medical School Admissions: Focusing on Producing a Physician Workforce That Addresses the Needs of the United States.,2023,Academic medicine : journal of the Association of American Medical Colleges,,,,"The aging population, burnout, and earlier retirement of physicians along with the static number of training positions are likely to worsen the current physician shortage. There is an urgent need to transform the process for selecting medical students. In this Invited Commentary, the authors suggest that to build the physician workforce that the United States needs for the future, academic medicine should focus on building capacity in 3 overarching areas. First, medical schools need to develop a more diverse pool of capable applicants that better matches the demographic characteristics of health care trainees with those of the population, and they need to nurture applicants with diverse career aspirations. Second, medical schools should recalibrate their student selection process, aligning criteria for admission with competencies expected of medical school graduates, whether they choose to become practicing clinicians, physician-scientists, members of the public health workforce, or policy makers. Selection criteria that overweight the results of standardized test scores should be replaced by assessments that value and predict academic capacity, adaptive learning skills, curiosity, compassion, empathy, emotional maturity, and superior communication skills. Finally, to improve the equity and effectiveness of the selection processes, medical schools should leverage innovations in data science and generative artificial intelligence platforms. The ability of ChatGPT to pass the United States Medical Licensing Examination (USMLE) demonstrates the decreasing importance of memorization in medicine in favor of critical thinking and problem-solving skills. The 2022 change in the USMLE Step 1 to pass/fail plus the exodus of several prominent medical schools from the U.S. News and World Report rankings have exposed limitations of the current selection processes. Newer approaches that use precision education systems to leverage data and technology can help address these limitations.",Prober CG; Desai SV
39050000,How to mitigate the risks of deployment of artificial intelligence in medicine?,2024,Turkish journal of medical sciences,,,,"The aim of this study is to examine the risks associated with the use of artificial intelligence (AI) in medicine and to offer policy suggestions to reduce these risks and optimize the benefits of AI technology. AI is a multifaceted technology. If harnessed effectively, it has the capacity to significantly impact the future of humanity in the field of health, as well as in several other areas. However, the rapid spread of this technology also raises significant ethical, legal, and social issues. This study examines the potential dangers of AI integration in medicine by reviewing current scientific work and exploring strategies to mitigate these risks. Biases in data sets for AI systems can lead to inequities in health care. Educational data that is narrowly represented based on a demographic group can lead to biased results from AI systems for those who do not belong to that group. In addition, the concepts of explainability and accountability in AI systems could create challenges for healthcare professionals in understanding and evaluating AI-generated diagnoses or treatment recommendations. This could jeopardize patient safety and lead to the selection of inappropriate treatments. Ensuring the security of personal health information will be critical as AI systems become more widespread. Therefore, improving patient privacy and security protocols for AI systems is imperative. The report offers suggestions for reducing the risks associated with the increasing use of AI systems in the medical sector. These include increasing AI literacy, implementing a participatory society-in-the-loop management strategy, and creating ongoing education and auditing systems. Integrating ethical principles and cultural values into the design of AI systems can help reduce healthcare disparities and improve patient care. Implementing these recommendations will ensure the efficient and equitable use of AI systems in medicine, improve the quality of healthcare services, and ensure patient safety.",Uygun Ilikhan S; Ozer M; Tanberkan H; Bozkurt V
39401518,[AI-supported decision-making in obstetrics - a feasibility study on the medical accuracy and reliability of ChatGPT].,2025,Zeitschrift fur Geburtshilfe und Neonatologie,,,,"The aim of this study is to investigate the feasibility of artificial intelligence in the interpretation and application of medical guidelines to support clinical decision-making in obstetrics. ChatGPT was provided with guidelines on specific obstetric issues. Using several clinical scenarios as examples, the AI was then evaluated for its ability to make accurate diagnoses and appropriate clinical decisions. The results varied, with ChatGPT providing predominantly correct answers in some fictional scenarios but performing inadequately in others. Despite ChatGPT's ability to grasp complex medical information, the study revealed limitations in the precision and reliability of its interpretations and recommendations. These discrepancies highlight the need for careful review by healthcare professionals and underscore the importance of clear, unambiguous guideline recommendations. Furthermore, continuous technical development is required to harness artificial intelligence as a supportive tool in clinical practice. Overall, while the use of AI in medicine shows promise, its current suitability primarily lies in controlled scientific settings due to potential error susceptibility and interpretation weaknesses, aiming to safeguard the safety and accuracy of patient care.",Bader S; Schneider MO; Psilopatis I; Anetsberger D; Emons J; Kehl S
38886470,Multi role ChatGPT framework for transforming medical data analysis.,2024,Scientific reports,,,,"The application of ChatGPTin the medical field has sparked debate regarding its accuracy. To address this issue, we present a Multi-Role ChatGPT Framework (MRCF), designed to improve ChatGPT's performance in medical data analysis by optimizing prompt words, integrating real-world data, and implementing quality control protocols. Compared to the singular ChatGPT model, MRCF significantly outperforms traditional manual analysis in interpreting medical data, exhibiting fewer random errors, higher accuracy, and better identification of incorrect information. Notably, MRCF is over 600 times more time-efficient than conventional manual annotation methods and costs only one-tenth as much. Leveraging MRCF, we have established two user-friendly databases for efficient and straightforward drug repositioning analysis. This research not only enhances the accuracy and efficiency of ChatGPT in medical data science applications but also offers valuable insights for data analysis models across various professional domains.",Chen H; Zhang S; Zhang L; Geng J; Lu J; Hou C; He P; Lu X
37956228,ChatGPT in Drug Discovery: A Case Study on Anticocaine Addiction Drug Development with Chatbots.,2023,Journal of chemical information and modeling,,,,"The birth of ChatGPT, a cutting-edge language model-based chatbot developed by OpenAI, ushered in a new era in AI. However, due to potential pitfalls, its role in rigorous scientific research is not clear yet. This paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anticocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers toward innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of pharmaceutical solutions. This paper not only explores the integration of advanced AI in drug discovery but also reimagines the landscape by advocating for AI-powered chatbots as trailblazers in revolutionizing therapeutic innovation.",Wang R; Feng H; Wei GW
37645039,ChatGPT in Drug Discovery: A Case Study on Anti-Cocaine Addiction Drug Development with Chatbots.,2023,ArXiv,,,,"The birth of ChatGPT, a cutting-edge language model-based chatbot developed by OpenAI, ushered in a new era in AI. However, due to potential pitfalls, its role in rigorous scientific research is not clear yet. This paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anti-cocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers towards innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of potential pharmaceutical solutions. This paper not only explores the integration of advanced AI in drug discovery but also reimagines the landscape by advocating for AI-powered chatbots as trailblazers in revolutionizing therapeutic innovation.",Wang R; Feng H; Wei GW
38472596,Artificial intelligence in academic writing and clinical pharmacy education: consequences and opportunities.,2024,International journal of clinical pharmacy,,,,"The current academic debate on the use of artificial intelligence (AI) in research and teaching has been ongoing since the launch of ChatGPT in November 2022. It mainly focuses on ethical considerations, academic integrity, authorship and the need for new legal frameworks. Time efficiencies may allow for more critical thinking, while ease of pattern recognition across large amounts of data may promote drug discovery, better clinical decision making and guideline development with resultant consequences for patient safety. AI is also prompting a re-evaluation of the nature of learning and the purpose of education worldwide. It challenges traditional pedagogies, forcing a shift from rote learning to more critical, analytical, and creative thinking skills. Despite this opportunity to re-think education concepts for pharmacy curricula several universities around the world have banned its use. This commentary summarizes the existing debate and identifies the consequences and opportunities for clinical pharmacy research and education.",Weidmann AE
38811359,Attention mechanism models for precision medicine.,2024,Briefings in bioinformatics,,,,"The development of deep learning models plays a crucial role in advancing precision medicine. These models enable personalized medical treatments and interventions based on the unique genetic, environmental and lifestyle factors of individual patients, and the promotion of precision medicine is achieved mainly through genomic data analysis, variant annotation and interpretation, pharmacogenomics research, biomarker discovery, disease typing, clinical decision support and disease mechanism interpretation. Extensive research has been conducted to address precision medicine challenges using attention mechanism models such as SAN, GAT and transformers. Especially, the recent popularity of ChatGPT has significantly propelled the application of this model type to a new height. Therefore, I propose a Special Issue for Briefings in Bioinformatics about the topic 'Attention Mechanism Models for Precision Medicine'. This Special Issue aims to provide a comprehensive overview and presentation of innovative researches on the application of graph attention mechanism models in precision medicine.",Cheng L
37731643,Applications of large language models in cancer care: current evidence and future perspectives.,2023,Frontiers in oncology,,,,"The development of large language models (LLMs) is a recent success in the field of generative artificial intelligence (AI). They are computer models able to perform a wide range of natural language processing tasks, including content generation, question answering, or language translation. In recent months, a growing number of studies aimed to assess their potential applications in the field of medicine, including cancer care. In this mini review, we described the present published evidence for using LLMs in oncology. All the available studies assessed ChatGPT, an advanced language model developed by OpenAI, alone or compared to other LLMs, such as Google Bard, Chatsonic, and Perplexity. Although ChatGPT could provide adequate information on the screening or the management of specific solid tumors, it also demonstrated a significant error rate and a tendency toward providing obsolete data. Therefore, an accurate, expert-driven verification process remains mandatory to avoid the potential for misinformation and incorrect evidence. Overall, although this new generative AI-based technology has the potential to revolutionize the field of medicine, including that of cancer care, it will be necessary to develop rules to guide the application of these tools to maximize benefits and minimize risks.",Iannantuono GM; Bracken-Clarke D; Floudas CS; Roselli M; Gulley JL; Karzai F
36834073,Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study.,2023,International journal of environmental research and public health,,,,"The diagnostic accuracy of differential diagnoses generated by artificial intelligence (AI) chatbots, including the generative pretrained transformer 3 (GPT-3) chatbot (ChatGPT-3) is unknown. This study evaluated the accuracy of differential-diagnosis lists generated by ChatGPT-3 for clinical vignettes with common chief complaints. General internal medicine physicians created clinical cases, correct diagnoses, and five differential diagnoses for ten common chief complaints. The rate of correct diagnosis by ChatGPT-3 within the ten differential-diagnosis lists was 28/30 (93.3%). The rate of correct diagnosis by physicians was still superior to that by ChatGPT-3 within the five differential-diagnosis lists (98.3% vs. 83.3%, p = 0.03). The rate of correct diagnosis by physicians was also superior to that by ChatGPT-3 in the top diagnosis (53.3% vs. 93.3%, p < 0.001). The rate of consistent differential diagnoses among physicians within the ten differential-diagnosis lists generated by ChatGPT-3 was 62/88 (70.5%). In summary, this study demonstrates the high diagnostic accuracy of differential-diagnosis lists generated by ChatGPT-3 for clinical cases with common chief complaints. This suggests that AI chatbots such as ChatGPT-3 can generate a well-differentiated diagnosis list for common chief complaints. However, the order of these lists can be improved in the future.",Hirosawa T; Harada Y; Yokose M; Sakamoto T; Kawamura R; Shimizu T
38413108,Updated Primer on Generative Artificial Intelligence and Large Language Models in Medical Imaging for Medical Professionals.,2024,Korean journal of radiology,,,,"The emergence of Chat Generative Pre-trained Transformer (ChatGPT), a chatbot developed by OpenAI, has garnered interest in the application of generative artificial intelligence (AI) models in the medical field. This review summarizes different generative AI models and their potential applications in the field of medicine and explores the evolving landscape of Generative Adversarial Networks and diffusion models since the introduction of generative AI models. These models have made valuable contributions to the field of radiology. Furthermore, this review also explores the significance of synthetic data in addressing privacy concerns and augmenting data diversity and quality within the medical domain, in addition to emphasizing the role of inversion in the investigation of generative models and outlining an approach to replicate this process. We provide an overview of Large Language Models, such as GPTs and bidirectional encoder representations (BERTs), that focus on prominent representatives and discuss recent initiatives involving language-vision models in radiology, including innovative large language and vision assistant for biomedicine (LLaVa-Med), to illustrate their practical application. This comprehensive review offers insights into the wide-ranging applications of generative AI models in clinical research and emphasizes their transformative potential.",Kim K; Cho K; Jang R; Kyung S; Lee S; Ham S; Choi E; Hong GS; Kim N
37852647,GPT-4 in Nuclear Medicine Education: Does It Outperform GPT-3.5?,2023,Journal of nuclear medicine technology,,,,"The emergence of ChatGPT has challenged academic integrity in teaching institutions, including those providing nuclear medicine training. Although previous evaluations of ChatGPT have suggested a limited scope for academic writing, the March 2023 release of generative pretrained transformer (GPT)-4 promises enhanced capabilities that require evaluation. Methods: Examinations (final and calculation) and written assignments for nuclear medicine subjects were tested using GPT-3.5 and GPT-4. GPT-3.5 and GPT-4 responses were evaluated by Turnitin software for artificial intelligence scores, marked against standardized rubrics, and compared with the mean performance of student cohorts. Results: ChatGPT powered by GPT-3.5 performed poorly in calculation examinations (31.4%), compared with GPT-4 (59.1%). GPT-3.5 failed each of 3 written tasks (39.9%), whereas GPT-4 passed each task (56.3%). Conclusion: Although GPT-3.5 poses a minimal risk to academic integrity, its usefulness as a cheating tool can be significantly enhanced by GPT-4 but remains prone to hallucination and fabrication.",Currie GM
38295300,Medical malpractice liability in large language model artificial intelligence: legal review and policy recommendations.,2024,Journal of osteopathic medicine,,,,"The emergence of generative large language model (LLM) artificial intelligence (AI) represents one of the most profound developments in healthcare in decades, with the potential to create revolutionary and seismic changes in the practice of medicine as we know it. However, significant concerns have arisen over questions of liability for bad outcomes associated with LLM AI-influenced medical decision making. Although the authors were not able to identify a case in the United States that has been adjudicated on medical malpractice in the context of LLM AI at this time, sufficient precedent exists to interpret how analogous situations might be applied to these cases when they inevitably come to trial in the future. This commentary will discuss areas of potential legal vulnerability for clinicians utilizing LLM AI through review of past case law pertaining to third-party medical guidance and review the patchwork of current regulations relating to medical malpractice liability in AI. Finally, we will propose proactive policy recommendations including creating an enforcement duty at the US Food and Drug Administration (FDA) to require algorithmic transparency, recommend reliance on peer-reviewed data and rigorous validation testing when LLMs are utilized in clinical settings, and encourage tort reform to share liability between physicians and LLM developers.",Shumway DO; Hartman HJ
40043742,"Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance.",2025,European journal of pediatric surgery : official journal of Austrian Association of Pediatric Surgery ... [et al] = Zeitschrift fur Kinderchirurgie,,,,"The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.",Gnatzy R; Lacher M; Berger M; Boettcher M; Deffaa OJ; Kubler J; Madadi-Sanjani O; Martynov I; Mayer S; Pakarinen MP; Wagner R; Wester T; Zani A; Aubert O
39417635,GPT-4-based AI agents-the new expert system for detection of antimicrobial resistance mechanisms?,2024,Journal of clinical microbiology,,,,"The European Committee on Antimicrobial Susceptibility Testing (EUCAST) recommends two steps for detecting beta-lactamases in Gram-negative bacteria. Screening for potential extended-spectrum beta-lactamase (ESBL), plasmid-mediated AmpC beta-lactamase, or carbapenemase production is confirmed. We aimed to validate generative pre-trained transformer (GPT)-4 and GPT-agent for pre-classification of disk diffusion to indicate potential beta-lactamases. We assigned 225 Gram-negative isolates based on phenotypic resistances against beta-lactam antibiotics and additional tests to one or more resistance mechanisms as follows: ""none,"" ""ESBL,"" ""AmpC,"" or ""carbapenemase."" Next, we customized a GPT-agent with EUCAST guidelines and breakpoint table (v13.1). We compared routine diagnostics (reference) to those of (i) EUCAST-GPT-expert, (ii) microbiologists, and (iii) non-customized GPT-4. We determined sensitivities and specificities to flag suspect resistances. Three microbiologists showed concordance in 814/862 (94.4%) phenotypic categories and were used in median eight words (interquartile range [IQR] 4-11) for reasoning. Median sensitivity/specificity for ESBL, AmpC, and carbapenemase were 98%/99.1%, 96.8%/97.1%, and 95.5%/98.5%, respectively. Three prompts of EUCAST-GPT-expert showed concordance in 706/862 (81.9%) categories but were used in median 158 words (IQR 140-174) for reasoning. Sensitivity/specificity for ESBL, AmpC, and carbapenemase prediction were 95.4%/69.23%, 96.9%/86.3%, and 100%/98.8%, respectively. Non-customized GPT-4 could interpret 169/862 (19.6%) categories, and 137/169 (81.1%) agreed with routine diagnostics. Non-customized GPT-4 was used in median 85 words (IQR 72-105) for reasoning. Microbiologists showed higher concordance and shorter argumentations compared to GPT-agents. Humans showed higher specificities compared to GPT-agents. GPT-agent's unspecific flagging of ESBL and AmpC potentially results in additional testing, diagnostic delays, and higher costs. GPT-4 is not approved by regulatory bodies, but validation of large language models is needed. IMPORTANCE: The study titled ""GPT-4-based AI agents-the new expert system for detection of antimicrobial resistance mechanisms?"" is critically important as it explores the integration of advanced artificial intelligence (AI) technologies, like generative pre-trained transformer (GPT)-4, into the field of laboratory medicine, specifically in the diagnostics of antimicrobial resistance (AMR). With the growing challenge of AMR, there is a pressing need for innovative solutions that can enhance diagnostic accuracy and efficiency. This research assesses the capability of AI to support the existing two-step confirmatory process recommended by the European Committee on Antimicrobial Susceptibility Testing for detecting beta-lactamases in Gram-negative bacteria. By potentially speeding up and improving the precision of initial screenings, AI could reduce the time to appropriate treatment interventions. Furthermore, this study is vital for validating the reliability and safety of AI tools in clinical settings, ensuring they meet stringent regulatory standards before they can be broadly implemented. This could herald a significant shift in how laboratory diagnostics are performed, ultimately leading to better patient outcomes.",Giske CG; Bressan M; Fiechter F; Hinic V; Mancini S; Nolte O; Egli A
37211242,ChatGPT and autoimmunity - A new weapon in the battlefield of knowledge.,2023,Autoimmunity reviews,,,,"The field of medical research has been always full of innovation and huge leaps revolutionizing the scientific world. In the recent years, we have witnessed this firsthand by the evolution of Artificial Intelligence (AI), with ChatGPT being the most recent example. ChatGPT is a language chat bot which generates human-like texts based on data from the internet. If viewed from a medical point view, ChatGPT has shown capabilities of composing medical texts similar to those depicted by experienced authors, to solve clinical cases, to provide medical solutions, among other fascinating performances. Nevertheless, the value of the results, limitations, and clinical implications still need to be carefully evaluated. In our current paper on the role of ChatGPT in clinical medicine, particularly in the field of autoimmunity, we aimed to illustrate the implication of this technology alongside the latest utilization and limitations. In addition, we included an expert opinion on the cyber-related aspects of the bot potentially contributing to the risks attributed to its use, alongside proposed defense mechanisms. All of that, while taking into consideration the rapidity of the continuous improvement AI experiences on a daily basis.",Darkhabani M; Alrifaai MA; Elsalti A; Dvir YM; Mahroum N
37162073,FUTURE OF THE LANGUAGE MODELS IN HEALTHCARE: THE ROLE OF CHATGPT.,2023,Arquivos brasileiros de cirurgia digestiva : ABCD = Brazilian archives of digestive surgery,,,,"The field of medicine has always been at the forefront of technological innovation, constantly seeking new strategies to diagnose, treat, and prevent diseases. Guidelines for clinical practice to orientate medical teams regarding diagnosis, treatment, and prevention measures have increased over the years. The purpose is to gather the most medical knowledge to construct an orientation for practice. Evidence-based guidelines follow several main characteristics of a systematic review, including systematic and unbiased search, selection, and extraction of the source of evidence. In recent years, the rapid advancement of artificial intelligence has provided clinicians and patients with access to personalized, data-driven insights, support and new opportunities for healthcare professionals to improve patient outcomes, increase efficiency, and reduce costs. One of the most exciting developments in Artificial Intelligence has been the emergence of chatbots. A chatbot is a computer program used to simulate conversations with human users. Recently, OpenAI, a research organization focused on machine learning, developed ChatGPT, a large language model that generates human-like text. ChatGPT uses a type of AI known as a deep learning model. ChatGPT can quickly search and select pieces of evidence through numerous databases to provide answers to complex questions, reducing the time and effort required to research a particular topic manually. Consequently, language models can accelerate the creation of clinical practice guidelines. While there is no doubt that ChatGPT has the potential to revolutionize the way healthcare is delivered, it is essential to note that it should not be used as a substitute for human healthcare professionals. Instead, ChatGPT should be considered a tool that can be used to augment and support the work of healthcare professionals, helping them to provide better care to their patients.",Tustumi F; Andreollo NA; Aguilar-Nascimento JE
37223340,ChatGPT-4 and the Global Burden of Disease Study: Advancing Personalized Healthcare Through Artificial Intelligence in Clinical and Translational Medicine.,2023,Cureus,,,,"The fusion of insights from the comprehensive global burden of disease (GBD) study and the advanced artificial intelligence of open artificial intelligence (AI) chat generative pre-trained transformer version 4 (ChatGPT-4) brings the potential to transform personalized healthcare planning. By integrating the data-driven findings of the GBD study with the powerful conversational capabilities of ChatGPT-4, healthcare professionals can devise customized healthcare plans that are adapted to patients' lifestyles and preferences. We propose that this innovative partnership can lead to the creation of a novel AI-assisted personalized disease burden (AI-PDB) assessment and planning tool. For the successful implementation of this unconventional technology, it is crucial to ensure continuous and accurate updates, expert supervision, and address potential biases and limitations. Healthcare professionals and stakeholders should have a balanced and dynamic approach, emphasizing interdisciplinary collaborations, data accuracy, transparency, ethical compliance, and ongoing training. By investing in the unique strengths of both ChatGPT-4, especially its newly introduced features such as live internet browsing or plugins, and the GBD study, we may enhance personalized healthcare planning. This innovative approach has the potential to improve patient outcomes and optimize resource utilization, as well as pave the way for the worldwide implementation of precision medicine, thereby revolutionizing the existing healthcare landscape. However, to fully harness these benefits at both the global and individual levels, further research and development are warranted. This will ensure that we effectively tap into the potential of this synergy, bringing societies closer to a future where personalized healthcare is the norm rather than the exception.",Temsah MH; Jamal A; Aljamaan F; Al-Tawfiq JA; Al-Eyadhy A
38482764,"[Applications, techniques, and best practices for using ChatGPT].",2024,Revue medicale suisse,,,,"The future of a machine writing our reports for us could also lead to it carrying out our consultations, a scenario whose relevance is open to debate. Nevertheless, the present offers us new artificial intelligence tools that can support us in our daily activities. The publication in 2017 of Transformers initiated a disruptive revolution by enabling the emergence of major language models, of which ChatGPT is the best known. In view of their growing adoption, the authors felt it would be useful to offer some pragmatic advice on how to improve the use of these tools. In this article, we first look at how ChatGPT works and its potential applications in medicine, before providing a practical guide to using it to get the best results.",Garin D; Lovis C
38935938,ChatGPT and Medicine: Together We Embrace the AI Renaissance.,2024,JMIR bioinformatics and biotechnology,,,,"The generative artificial intelligence (AI) model ChatGPT holds transformative prospects in medicine. The development of such models has signaled the beginning of a new era where complex biological data can be made more accessible and interpretable. ChatGPT is a natural language processing tool that can process, interpret, and summarize vast data sets. It can serve as a digital assistant for physicians and researchers, aiding in integrating medical imaging data with other multiomics data and facilitating the understanding of complex biological systems. The physician's and AI's viewpoints emphasize the value of such AI models in medicine, providing tangible examples of how this could enhance patient care. The editorial also discusses the rise of generative AI, highlighting its substantial impact in democratizing AI applications for modern medicine. While AI may not supersede health care professionals, practitioners incorporating AI into their practices could potentially have a competitive edge.",Hacking S
39017630,Solving Advanced Task-Specific Problems in Measurement Sciences with Generative AI.,2024,Analytical chemistry,,,,"The Generative Pre-Trained Transformer known as ChatGPT-4 has undergone extensive pretraining on a diverse data set, enabling it to generate coherent and contextually relevant text based on the input it receives. This capability allows it to perform tasks from answering questions and has attracted significant interest in material sciences, synthetic chemistry, and drug discovery. In this work, we posed four advanced task-specific problems to ChatGPT, which were recently published in leading journals for topics in analytical chemistry, spectroscopy, bioimage super-resolution, and electrochemistry. ChatGPT-4 successfully implemented the four ideas after assigning the ""persona"" to the AI and posing targeted questions. We show two cases where ""unguided"" ChatGPT could complete the assignments with minimal human direction. The construction of a microwave spectrum from a free induction curve and super-resolution of bioimages was accomplished using this approach. Two other specific tasks, correcting a complex baseline with morphological operations of set theory and estimating the diffusion current, required additional input, e.g., equations and specific directions from the user. In each case, the MATLAB code was eventually generated by ChatGPT-4 even when the original authors did not provide any code themselves. We show that a validation test must be implemented to detect and correct mistakes made by ChatGPT-4, followed by feedback correction. These approaches will pave the way for open and transparent science and eliminate the black boxes in measurement science when it comes to advanced data processing.",Wahab MF; Handlovic TT; Roy S; Burk RJ; Armstrong DW
37699647,ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4.,2023,Journal of nuclear medicine technology,,,,"The GPT-3.5-powered ChatGPT was released in late November 2022 powered by the generative pretrained transformer (GPT) version 3.5. It has emerged as a readily accessible source of patient information ahead of medical procedures. Although ChatGPT has purported benefits for supporting patient education and information, actual capability has not been evaluated. Moreover, the March 2023 emergence of paid subscription access to GPT-4 promises further enhanced capabilities requiring evaluation. Methods: ChatGPT was used to generate patient information sheets suitable for gaining informed consent for 7 common procedures in nuclear medicine. Responses were generated independently for both GPT-3.5 and GPT-4 architectures. Specific procedures were selected that had a long-standing history of use to avoid any bias associated with the September 2021 learning cutoff that constrains both GPT-3.5 and GPT-4 architectures. Each information sheet was independently evaluated by 3 expert assessors and ranked on the basis of accuracy, appropriateness, currency, and fitness for purpose. Results: ChatGPT powered by GPT-3.5 provided patient information that was appropriate in terms of being patient-facing but lacked accuracy and currency and omitted important information. GPT-3.5 produced patient information deemed not fit for the purpose. GPT-4 provided patient information enhanced across appropriateness, accuracy, and currency, despite some omission of information. GPT-4 produced patient information that was largely fit for the purpose. Conclusion: Although ChatGPT powered by GPT-3.5 is accessible and provides plausible patient information, inaccuracies and omissions present a risk to patients and informed consent. Conversely, GPT-4 is more accurate and fit for the purpose but, at the time of writing, was available only through a paid subscription.",Currie G; Robbie S; Tually P
39842505,Is ChatGPT Ready for Public Use in Organ-Specific Drug Toxicity Research?,2025,Drug discovery today,,,,"The growing impact of large language models (LLMs), such as ChatGPT, prompts questions about the reliability of their application in public health. We compared drug toxicity assessments by GPT-4 for liver, heart, and kidney against expert assessments using US Food and Drug Administration (FDA) drug-labeling documents. Two approaches were assessed: a 'General prompt', mimicking the conversational style used by the general public, and an 'Expert prompt' engineered to represent an approach of an expert. The Expert prompt achieved higher accuracy (64-75%) compared with the General prompt (48-72%), but the overall performance was moderate, indicating that caution is needed when using GPT-4 for public health. To improve reliability, an advanced framework,such as Retrieval Augmented Generation (RAG), might be required to leverage knowledge embedded in GPT-4.",Connor S; Wu L; Roberts RA; Tong W
38153778,Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care.,2023,JMIR medical education,,,,"The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment.",Koranteng E; Rao A; Flores E; Lev M; Landman A; Dreyer K; Succi M
40380650,Bridging AI and Medical Expertise: ChatGPT's Success on the Medical Specialization Residency Admission Exam in Spain.,2025,Studies in health technology and informatics,,,,"The growing use of Artificial Intelligence (AI) in healthcare, particularly focusing on the potential of generative AI models like ChatGPT-4 is a trending topic. The study examines how ChatGPT-4 performed on the national Medicine Residency exam in Spain, a highly selective test for accessing the medical specialization training program called MIR. ChatGPT-4 answered 210 questions, including 25 that required image interpretation. The chatbot correctly answered 150 out of 200 questions, achieving an estimated ranking of around 1900-2300 out of 11,577 candidates. This performance would allow access to most medical specialties in Spain. No significant differences were found between questions requiring image analysis and those that did not, but ChatGPT struggled with more difficult questions, showing a higher error rate for complex problems just like a human being. Despite its potential as an educational and problem-solving tool, the study highlights ChatGPT's limitations, including occasional ""AI hallucinations"" (incorrect or nonsensical answers) and variability in responses when questions were repeated. The study emphasizes that while AI tools such as ChatGPT can assist in education and medical tasks, they cannot replace qualified healthcare professionals, and their output requires careful verification.",Leis A; Mayer MA; Mayer A
39982968,Large Language Models as Tools for Molecular Toxicity Prediction: AI Insights into Cardiotoxicity.,2025,Journal of chemical information and modeling,,,,"The importance of drug toxicity assessment lies in ensuring the safety and efficacy of the pharmaceutical compounds. Predicting toxicity is crucial in drug development and risk assessment. This study compares the performance of GPT-4 and GPT-4o with traditional deep-learning and machine-learning models, WeaveGNN, MorganFP-MLP, SVC, and KNN, in predicting molecular toxicity, focusing on bone, neuro, and reproductive toxicity. The results indicate that GPT-4 is comparable to deep-learning and machine-learning models in certain areas. We utilized GPT-4 combined with molecular docking techniques to study the cardiotoxicity of three specific targets, examining traditional Chinese medicinal materials listed as both food and medicine. This approach aimed to explore the potential cardiotoxicity and mechanisms of action. The study found that components in Black Sesame, Ginger, Perilla, Sichuan Pagoda Tree Fruit, Galangal, Turmeric, Licorice, Chinese Yam, Amla, and Nutmeg exhibit toxic effects on cardiac target Cav1.2. The docking results indicated significant binding affinities, supporting the hypothesis of potential cardiotoxic effects.This research highlights the potential of ChatGPT in predicting molecular properties and its significance in medicinal chemistry, demonstrating its facilitation of a new research paradigm: with a data set, high-accuracy learning models can be generated without requiring computational knowledge or coding skills, making it accessible and easy to use.",Yang H; Xiu J; Yan W; Liu K; Cui H; Wang Z; He Q; Gao Y; Han W
37546190,Biomedical Ethical Aspects Towards the Implementation of Artificial Intelligence in Medical Education.,2023,Medical science educator,,,,"The increasing use of artificial intelligence (AI) in medicine is associated with new ethical challenges and responsibilities. However, special considerations and concerns should be addressed when integrating AI applications into medical education, where healthcare, AI, and education ethics collide. This commentary explores the biomedical ethical responsibilities of medical institutions in incorporating AI applications into medical education by identifying potential concerns and limitations, with the goal of implementing applicable recommendations. The recommendations presented are intended to assist in developing institutional guidelines for the ethical use of AI for medical educators and students.",Busch F; Adams LC; Bressem KK
37873042,Healthcare's New Horizon With ChatGPT's Voice and Vision Capabilities: A Leap Beyond Text.,2023,Cureus,,,,"The integration of artificial intelligence (AI) in healthcare is responsible for a paradigm shift in medicine. OpenAI's recent augmentation of their Generative Pre-trained Transformer (ChatGPT) large language model (LLM) with voice and image recognition capabilities (OpenAI, Delaware) presents another potential transformative tool for healthcare. Envision a healthcare setting where professionals engage in dynamic interactions with ChatGPT to navigate the complexities of atypical medical scenarios. In this innovative landscape, practitioners could solicit ChatGPT's expertise for concise summarizations and insightful extrapolations from a myriad of web-based resources pertaining to similar medical conditions. Furthermore, imagine patients using ChatGPT to identify abnormalities in medical images or skin lesions. While the prospects are diverse, challenges such as suboptimal audio quality and ensuring data security necessitate cautious integration in medical practice. Drawing insights from previous ChatGPT iterations could provide a prudent roadmap for navigating possible challenges. This editorial explores some possible horizons and potential hurdles of ChatGPT's enhanced functionalities in healthcare, emphasizing the importance of continued refinements and vigilance to maximize the benefits while minimizing risks. Through collaborative efforts between AI developers and healthcare professionals, another fusion of AI and healthcare can evolve into enriched patient care and enhanced medical experience.",Temsah R; Altamimi I; Alhasan K; Temsah MH; Jamal A
38678576,AI chatbots in pet health care: Opportunities and challenges for owners.,2024,Veterinary medicine and science,,,,"The integration of artificial intelligence (AI) into health care has seen remarkable advancements, with applications extending to animal health. This article explores the potential benefits and challenges associated with employing AI chatbots as tools for pet health care. Focusing on ChatGPT, a prominent language model, the authors elucidate its capabilities and its potential impact on pet owners' decision-making processes. AI chatbots offer pet owners access to extensive information on animal health, research studies and diagnostic options, providing a cost-effective and convenient alternative to traditional veterinary consultations. The fate of a case involving a Border Collie named Sassy demonstrates the potential benefits of AI in veterinary medicine. In this instance, ChatGPT played a pivotal role in suggesting a diagnosis that led to successful treatment, showcasing the potential of AI chatbots as valuable tools in complex cases. However, concerns arise regarding pet owners relying solely on AI chatbots for medical advice, potentially resulting in misdiagnosis, inappropriate treatment and delayed professional intervention. We emphasize the need for a balanced approach, positioning AI chatbots as supplementary tools rather than substitutes for licensed veterinarians. To mitigate risks, the article proposes strategies such as educating pet owners on AI chatbots' limitations, implementing regulations to guide AI chatbot companies and fostering collaboration between AI chatbots and veterinarians. The intricate web of responsibilities in this dynamic landscape underscores the importance of government regulations, the educational role of AI chatbots and the symbiotic relationship between AI technology and veterinary expertise. In conclusion, while AI chatbots hold immense promise in transforming pet health care, cautious and informed usage is crucial. By promoting awareness, establishing regulations and fostering collaboration, the article advocates for a responsible integration of AI chatbots to ensure optimal care for pets.",Jokar M; Abdous A; Rahmanian V
39677224,Human-Computer Interaction: A Literature Review of Artificial Intelligence and Communication in Healthcare.,2024,Cureus,,,,"The integration of artificial intelligence (AI) into healthcare communication has rapidly evolved, driven by advancements in large language models (LLMs) such as Chat Generative Pre-trained Transformer (ChatGPT). This literature review explores AI's role in patient-physician interactions, particularly focusing on its capacity to enhance communication by bridging language barriers, summarizing complex medical data, and offering empathetic responses. AI's strengths lie in its ability to deliver comprehensible, concise, and medically accurate information. Studies indicate AI can outperform human physicians in certain communicative aspects, such as empathy and clarity, with models like ChatGPT and the Medical Pathways Language Model (Med-PaLM) demonstrating high effectiveness in these areas. However, significant challenges remain, including occasional inaccuracies and ""hallucinations,"" where AI-generated content is irrelevant or medically inaccurate. These limitations highlight the need for continued refinement in AI algorithms to ensure reliability and consistency in sensitive healthcare settings. The review underscores the potential of AI as a transformative tool in health communication while advocating for further research and policy development to mitigate risks and enhance AI's integration into clinical practice.",Clay TJ; Da Custodia Steel ZJ; Jacobs C
39897335,OpenEvidence: Enhancing Medical Student Clinical Rotations With AI but With Limitations.,2025,Cureus,,,,"The integration of artificial intelligence (AI) into healthcare has introduced tools that improve medical education and clinical practice. OpenEvidence is an example, providing real-time synthesis and access to medical literature, particularly for medical students during clinical rotations. By enabling efficient searches for clinical guidelines, diagnostic criteria, and therapeutic approaches, it streamlines decision-making and study preparation. Its ability to present recent publications and highlight less commonly discussed treatments supports evidence-based learning. Despite these strengths, OpenEvidence has limitations. It struggles with targeted searches for specific articles, authors, or journals and operates through an opaque curation process. Compared to ChatGPT, which offers conversational interactivity, and UpToDate, known for its comprehensive, CME-accredited content, OpenEvidence lacks certain advanced features. However, its user-friendly design and focus on clinical evidence make it a valuable, accessible alternative. This editorial critically examines OpenEvidence's capabilities and limitations, comparing it with established tools. It emphasizes the need for greater transparency, broader evidence integration, and enhanced functionality to maximize its impact. Addressing these challenges could improve OpenEvidence's utility, supporting a more effective, evidence-based approach to medical education and clinical practice.",Patel N; Grewal H; Buddhavarapu V; Dhillon G
40052277,Leveraging ChatGPT in cardiogeriatrics.,2025,Minerva cardiology and angiology,,,,"The integration of artificial intelligence (AI) into healthcare is transforming medical practice, and this holds true also for the prevention, diagnosis and treatment of cardiovascular disease in older patients. Large language models (LLMs) like ChatGPT (OpenAI, San Francisco, CA, USA) represent cutting edge AI tools which may offer significant potential to enhance patient care by improving communication, aiding in diagnosis, and assisting in treatment planning. In elderly patients, who often present with complex health profiles and multiple comorbidities, AI can prove particularly beneficial, and it can analyze extensive data to provide personalized, evidence-based recommendations. For instance, ChatGPT can support clinicians in managing polypharmacy by identifying potential drug interactions and suggesting optimal medication regimens, thereby reducing adverse effects. Additionally, AI tools can help overcome therapeutic inertia by prompting timely treatment adjustments, ensuring that elderly patients receive appropriate interventions. However, the successful implementation of AI in cardiogeriatrics requires robust technological infrastructures, a synergistic integration with electronic health records, and careful consideration of ethical and privacy concerns. Ongoing collaboration between technologists and healthcare professionals is essential to address these challenges and fully realize the benefits of AI in enhancing cardiovascular care for the elderly.",Antonazzo B; Vassiliou VS; Lauretti A; Biondi-Zoccai G
37692617,Unraveling the Ethical Enigma: Artificial Intelligence in Healthcare.,2023,Cureus,,,,"The integration of artificial intelligence (AI) into healthcare promises groundbreaking advancements in patient care, revolutionizing clinical diagnosis, predictive medicine, and decision-making. This transformative technology uses machine learning, natural language processing, and large language models (LLMs) to process and reason like human intelligence. OpenAI's ChatGPT, a sophisticated LLM, holds immense potential in medical practice, research, and education. However, as AI in healthcare gains momentum, it brings forth profound ethical challenges that demand careful consideration. This comprehensive review explores key ethical concerns in the domain, including privacy, transparency, trust, responsibility, bias, and data quality. Protecting patient privacy in data-driven healthcare is crucial, with potential implications for psychological well-being and data sharing. Strategies like homomorphic encryption (HE) and secure multiparty computation (SMPC) are vital to preserving confidentiality. Transparency and trustworthiness of AI systems are essential, particularly in high-risk decision-making scenarios. Explainable AI (XAI) emerges as a critical aspect, ensuring a clear understanding of AI-generated predictions. Cybersecurity becomes a pressing concern as AI's complexity creates vulnerabilities for potential breaches. Determining responsibility in AI-driven outcomes raises important questions, with debates on AI's moral agency and human accountability. Shifting from data ownership to data stewardship enables responsible data management in compliance with regulations. Addressing bias in healthcare data is crucial to avoid AI-driven inequities. Biases present in data collection and algorithm development can perpetuate healthcare disparities. A public-health approach is advocated to address inequalities and promote diversity in AI research and the workforce. Maintaining data quality is imperative in AI applications, with convolutional neural networks showing promise in multi-input/mixed data models, offering a comprehensive patient perspective. In this ever-evolving landscape, it is imperative to adopt a multidimensional approach involving policymakers, developers, healthcare practitioners, and patients to mitigate ethical concerns. By understanding and addressing these challenges, we can harness the full potential of AI in healthcare while ensuring ethical and equitable outcomes.",Jeyaraman M; Balaji S; Jeyaraman N; Yadav S
37904646,The importance of transparency: Declaring the use of generative artificial intelligence (AI) in academic writing.,2024,Journal of nursing scholarship : an official publication of Sigma Theta Tau International Honor Society of Nursing,,,,"The integration of generative artificial intelligence (AI) into academic research writing has revolutionized the field, offering powerful tools like ChatGPT and Bard to aid researchers in content generation and idea enhancement. We explore the current state of transparency regarding generative AI use in nursing academic research journals, emphasizing the need for explicitly declaring the use of generative AI by authors in the manuscript. Out of 125 nursing studies journals, 37.6% required explicit statements about generative AI use in their authors' guidelines. No significant differences in impact factors or journal categories were found between journals with and without such requirement. A similar evaluation of medicine, general and internal journals showed a lower percentage (14.5%) including the information about generative AI usage. Declaring generative AI tool usage is crucial for maintaining the transparency and credibility in academic writing. Additionally, extending the requirement for AI usage declarations to journal reviewers can enhance the quality of peer review and combat predatory journals in the academic publishing landscape. Our study highlights the need for active participation from nursing researchers in discussions surrounding standardization of generative AI declaration in academic research writing.",Tang A; Li KK; Kwok KO; Cao L; Luong S; Tam W
38508662,Performance of GPT-4 in Membership of the Royal College of Paediatrics and Child Health-style examination questions.,2024,BMJ paediatrics open,,,,The large language model (LLM) ChatGPT has been shown to have considerable utility across medicine and healthcare. This paper aims to explore the capabilities of GPT-4 (Generative Pre-trained Transformer 4) in answering Membership of the Royal College of Paediatrics and Child Health (MRCPCH) written paper-style questions. GPT-4 was subjected to four publicly available sample papers designed for those preparing to sit MRCPCH theory components. The model received no specialised training or reinforcement. The average score across all four papers was 78.1%. The model provided reasoning for its answers despite this not being required by the questions. This performance strengthens the case for incorporating LLMs into supporting roles for practising clinicians and medical education in paediatrics.,Armitage R
40256630,Large language models and rheumatology: are we there yet?,2025,Rheumatology advances in practice,,,,"The last 2 years have marked the beginning of a golden age for natural language processing in medicine. The arrival of large language models (LLMs) and multimodal models have raised new opportunities and challenges for research and clinical practice. In rheumatology, a specialty rich in data and requiring complex decision-making, the use of these tools may transform diagnostic procedures, improve patient interaction and simplify data management, leading to more personalized and efficient healthcare outcomes. The objective of this article is to present an overview of the status of LLMs in the field of rheumatology while discussing some of the challenges ahead in this area of great potential.",Benavent D; Madrid-Garcia A
39776586,Large language models facilitating modern molecular biology and novel drug development.,2024,Frontiers in pharmacology,,,,"The latest breakthroughs in information technology and biotechnology have catalyzed a revolutionary shift within the modern healthcare landscape, with notable impacts from artificial intelligence (AI) and deep learning (DL). Particularly noteworthy is the adept application of large language models (LLMs), which enable seamless and efficient communication between scientific researchers and AI systems. These models capitalize on neural network (NN) architectures that demonstrate proficiency in natural language processing, thereby enhancing interactions. This comprehensive review outlines the cutting-edge advancements in the application of LLMs within the pharmaceutical industry, particularly in drug development. It offers a detailed exploration of the core mechanisms that drive these models and zeroes in on the practical applications of several models that show great promise in this domain. Additionally, this review delves into the pivotal technical and ethical challenges that arise with the practical implementation of LLMs. There is an expectation that LLMs will assume a more pivotal role in the development of innovative drugs and will ultimately contribute to the accelerated development of revolutionary pharmaceuticals.",Liu XH; Lu ZH; Wang T; Liu F
38247934,ChatGPT in Occupational Medicine: A Comparative Study with Human Experts.,2024,"Bioengineering (Basel, Switzerland)",,,,"The objective of this study is to evaluate ChatGPT's accuracy and reliability in answering complex medical questions related to occupational health and explore the implications and limitations of AI in occupational health medicine. The study also provides recommendations for future research in this area and informs decision-makers about AI's impact on healthcare. A group of physicians was enlisted to create a dataset of questions and answers on Italian occupational medicine legislation. The physicians were divided into two teams, and each team member was assigned a different subject area. ChatGPT was used to generate answers for each question, with/without legislative context. The two teams then evaluated human and AI-generated answers blind, with each group reviewing the other group's work. Occupational physicians outperformed ChatGPT in generating accurate questions on a 5-point Likert score, while the answers provided by ChatGPT with access to legislative texts were comparable to those of professional doctors. Still, we found that users tend to prefer answers generated by humans, indicating that while ChatGPT is useful, users still value the opinions of occupational medicine professionals.",Padovan M; Cosci B; Petillo A; Nerli G; Porciatti F; Scarinci S; Carlucci F; Dell'Amico L; Meliani N; Necciari G; Lucisano VC; Marino R; Foddis R; Palla A
40122672,The role of large language models in the peer-review process: opportunities and challenges for medical journal reviewers and editors.,2025,Journal of educational evaluation for health professions,,,,"The peer review process ensures the integrity of scientific research. This is particularly important in the medical field, where research findings directly impact patient care. However, the rapid growth of publications has strained reviewers, causing delays and potential declines in quality. Generative artificial intelligence, especially large language models (LLMs) such as ChatGPT, may assist researchers with efficient, high-quality reviews. This review explores the integration of LLMs into peer review, highlighting their strengths in linguistic tasks and challenges in assessing scientific validity, particularly in clinical medicine. Key points for integration include initial screening, reviewer matching, feedback support, and language review. However, implementing LLMs for these purposes will necessitate addressing biases, privacy concerns, and data confidentiality. We recommend using LLMs as complementary tools under clear guidelines to support, not replace, human expertise in maintaining rigorous peer review standards.",Lee J; Lee J; Yoo JJ
38073698,Potential and limitations of ChatGPT and generative artificial intelligence in medical safety education.,2023,World journal of clinical cases,,,,"The primary objectives of medical safety education are to provide the public with essential knowledge about medications and to foster a scientific approach to drug usage. The era of using artificial intelligence to revolutionize medical safety education has already dawned, and ChatGPT and other generative artificial intelligence models have immense potential in this domain. Notably, they offer a wealth of knowledge, anonymity, continuous availability, and personalized services. However, the practical implementation of generative artificial intelligence models such as ChatGPT in medical safety education still faces several challenges, including concerns about the accuracy of information, legal responsibilities, and ethical obligations. Moving forward, it is crucial to intelligently upgrade ChatGPT by leveraging the strengths of existing medical practices. This task involves further integrating the model with real-life scenarios and proactively addressing ethical and security issues with the ultimate goal of providing the public with comprehensive, convenient, efficient, and personalized medical services.",Wang X; Liu XQ
40067594,Transforming plastic surgery: an innovative role of Chat GPT in plastic surgery practices.,2025,Updates in surgery,,,,"The proliferation of artificial intelligence (AI) in the healthcare sector is a present reality. The potential applications of Chat GPT in medicine are currently undergoing intense examination. This article seeks to examine the innovative capabilities and applications of Chat GPT in this field, highlighting its potential to revolutionize patient care and decision-making processes. PubMed, Scopus, Embase, Google Scholar, and Web of Science were searched by conducting a keyword search to locate studies examining the application of Chat GPT in the realm of plastic and reconstructive surgery. The titles, abstracts, and conclusions of the studies were scrutinized to select those most closely aligned with the focus of our study. This investigation involved a comprehensive review of 15 relevant articles from diverse geographical regions predominantly comprising original studies alongside five review articles. This study illustrates the significant promise of integrating Chat GPT across diverse areas of plastic surgery, encompassing research, surgeon and patient education, and clinical practice. However, the incorporation of Chat GPT into plastic surgery necessitates diligent oversight and the formulation of explicit guidelines and caution is necessary.",Mehraeen E; Attarian N; Tabari A; SeyedAlinaghi S
40330395,"A Practical Guide to the Utilization of ChatGPT in the Emergency Department: A Systematic Review of Current Applications, Future Directions, and Limitations.",2025,Cureus,,,,"The rapid development of artificial intelligence (AI) tools across various medical specialties highlights the potential for AI to transform medicine over the next 20 years. Despite this potential, the adoption of AI can feel incremental and disconnected from the daily practice of individual clinicians. For emergency department (ED) physicians practicing in 2025, recognizing and evaluating AI tools available for immediate integration into practice is essential. One such tool is ChatGPT (OpenAI, San Francisco, California, United States), a large language model (LLM) that is free, easily accessible via smartphones or computers, and widely used across industries. However, its usability in the ED setting remains poorly characterized. This review explores the current evidence surrounding ChatGPT 4's applications in various ED physician tasks, documenting its strengths and limitations. While ChatGPT demonstrates significant utility in language generation and administrative tasks, its potential for supporting more complex tasks in medical decision-making is emerging but not yet robust. The available evidence is limited and variable and lacks standardization, reflecting a field still in its early stages of development. Notably, the performance improvements observed between ChatGPT 3.5 and ChatGPT 4 suggest that future iterations, such as the anticipated release of ChatGPT 5, could significantly impact these findings. This review provides a comprehensive snapshot of the current state of evidence regarding ChatGPT's use in the ED, offering both an evaluation of its capabilities and a practical guide for its appropriate use by ED clinicians today.",Meyer NS; Meyer JW
37599212,[Applications and challenges of large language models in critical care medicine].,2023,Zhonghua yi xue za zhi,,,,"The rapid development of big data methods and technologies has provided more and more new ideas and methods for clinical diagnosis and treatment. The emergence of large language models (LLM) has made it possible for human-computer interactive dialogues and applications in complex medical scenarios. Critical care medicine is a process of continuous dynamic targeted treatment. The huge data generated in this process needs to be integrated and optimized through models for clinical application, interaction in teaching simulation, and assistance in scientific research. Using the LLM represented by generative pre-trained transformer ChatGPT can initially realize the application in the diagnosis of severe diseases, the prediction of death risk and the management of medical records. At the same time, the time and space limitations, illusions and ethical and moral issues of ChatGPT emerged as the times require. In the future, it is undeniable that it may play a huge role in the diagnosis and treatment of critical care medicine, but the current application should be combined with more clinical knowledge reserves of critical care medicine to carefully judge its conclusions.",Su LX; Weng L; Li WX; Long Y
37755165,A Bibliometric Analysis of the Rise of ChatGPT in Medical Research.,2023,"Medical sciences (Basel, Switzerland)",,,,"The rapid emergence of publicly accessible artificial intelligence platforms such as large language models (LLMs) has led to an equally rapid increase in articles exploring their potential benefits and risks. We performed a bibliometric analysis of ChatGPT literature in medicine and science to better understand publication trends and knowledge gaps. Following title, abstract, and keyword searches of PubMed, Embase, Scopus, and Web of Science databases for ChatGPT articles published in the medical field, articles were screened for inclusion and exclusion criteria. Data were extracted from included articles, with citation counts obtained from PubMed and journal metrics obtained from Clarivate Journal Citation Reports. After screening, 267 articles were included in the study, most of which were editorials or correspondence with an average of 7.5 +/- 18.4 citations per publication. Published articles on ChatGPT were authored largely in the United States, India, and China. The topics discussed included use and accuracy of ChatGPT in research, medical education, and patient counseling. Among non-surgical specialties, radiology published the most ChatGPT-related articles, while plastic surgery published the most articles among surgical specialties. The average citation number among the top 20 most-cited articles was 60.1 +/- 35.3. Among journals with the most ChatGPT-related publications, there were on average 10 +/- 3.7 publications. Our results suggest that managing the inevitable ethical and safety issues that arise with the implementation of LLMs will require further research exploring the capabilities and accuracy of ChatGPT, to generate policies guiding the adoption of artificial intelligence in medicine and science.",Barrington NM; Gupta N; Musmar B; Doyle D; Panico N; Godbole N; Reardon T; D'Amico RS
37987431,Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine.,2023,Clinics and practice,,,,"The rapid progress in artificial intelligence, machine learning, and natural language processing has led to increasingly sophisticated large language models (LLMs) for use in healthcare. This study assesses the performance of two LLMs, the GPT-3.5 and GPT-4 models, in passing the MIR medical examination for access to medical specialist training in Spain. Our objectives included gauging the model's overall performance, analyzing discrepancies across different medical specialties, discerning between theoretical and practical questions, estimating error proportions, and assessing the hypothetical severity of errors committed by a physician. MATERIAL AND METHODS: We studied the 2022 Spanish MIR examination results after excluding those questions requiring image evaluations or having acknowledged errors. The remaining 182 questions were presented to the LLM GPT-4 and GPT-3.5 in Spanish and English. Logistic regression models analyzed the relationships between question length, sequence, and performance. We also analyzed the 23 questions with images, using GPT-4's new image analysis capability. RESULTS: GPT-4 outperformed GPT-3.5, scoring 86.81% in Spanish (p < 0.001). English translations had a slightly enhanced performance. GPT-4 scored 26.1% of the questions with images in English. The results were worse when the questions were in Spanish, 13.0%, although the differences were not statistically significant (p = 0.250). Among medical specialties, GPT-4 achieved a 100% correct response rate in several areas, and the Pharmacology, Critical Care, and Infectious Diseases specialties showed lower performance. The error analysis revealed that while a 13.2% error rate existed, the gravest categories, such as ""error requiring intervention to sustain life"" and ""error resulting in death"", had a 0% rate. CONCLUSIONS: GPT-4 performs robustly on the Spanish MIR examination, with varying capabilities to discriminate knowledge across specialties. While the model's high success rate is commendable, understanding the error severity is critical, especially when considering AI's potential role in real-world medical practice and its implications for patient safety.",Guillen-Grima F; Guillen-Aguinaga S; Guillen-Aguinaga L; Alas-Brun R; Onambele L; Ortega W; Montejo R; Aguinaga-Ontoso E; Barach P; Aguinaga-Ontoso I
38832311,Challenges and opportunities of artificial intelligence implementation within sports science and sports medicine teams.,2024,Frontiers in sports and active living,,,,"The rapid progress in the development of automation and artificial intelligence (AI) technologies, such as ChatGPT, represents a step-wise change in human's interactions with technology as part of a broader complex, sociotechnical system. Based on historical parallels to the present moment, such changes are likely to bring forth structural shifts to the nature of work, where near and future technologies will occupy key roles as workers or assistants in sports science and sports medicine multidisciplinary teams (MDTs). This envisioned future may bring enormous benefits, as well as a raft of potential challenges. These challenges include the potential to remove many human roles and allocate them to semi- or fully-autonomous AI. Removing such roles and tasks from humans will make many current jobs and careers untenable, leaving a set of difficult and unrewarding tasks for the humans that remain. Paradoxically, replacing humans with technology increases system complexity and makes them more prone to failure. The automation and AI boom also brings substantial opportunities. Among them are automated sentiment analysis and Digital Twin technologies which may reveal novel insights into athlete health and wellbeing and team tactical patterns, respectively. However, without due consideration of the interactions between humans and technology in the broader system of sport, adverse impacts are likely to be felt. Human and AI teamwork may require new ways of thinking.",Naughton M; Salmon PM; Compton HR; McLean S
38384621,Can DALL-E 3 Reliably Generate 12-Lead ECGs and Teaching Illustrations?,2024,Cureus,,,,"The recent integration of the latest image generation model DALL-E 3 into ChatGPT allows text prompts to easily generate the corresponding images, enabling multimodal output from ChatGPT. We explored the feasibility of DALL-E 3 for drawing a 12-lead ECG and found that it can draw rudimentary 12-lead electrocardiograms (ECG) displaying some of the parameters, although the details are not completely accurate. We also explored DALL-E 3's capacity to create vivid illustrations for teaching resuscitation-related medical knowledge. DALL-E 3 produced accurate CPR illustrations emphasizing proper hand placement and technique. For ECG principles, it produced creative heart-shaped waveforms tying ECGs to the heart. With further training, DALL-E 3 shows promise to expand easy-to-understand visual medical teaching materials and ECG simulations for different disease states. In conclusion, DALL-E 3 has the potential to generate realistic 12-lead ECGs and teaching schematics, but expert validation is still needed.",Zhu L; Mou W; Wu K; Zhang J; Luo P
38290759,Potential applications and implications of large language models in primary care.,2024,Family medicine and community health,,,,"The recent release of highly advanced generative artificial intelligence (AI) chatbots, including ChatGPT and Bard, which are powered by large language models (LLMs), has attracted growing mainstream interest over its diverse applications in clinical practice, including in health and healthcare. The potential applications of LLM-based programmes in the medical field range from assisting medical practitioners in improving their clinical decision-making and streamlining administrative paperwork to empowering patients to take charge of their own health. However, despite the broad range of benefits, the use of such AI tools also comes with several limitations and ethical concerns that warrant further consideration, encompassing issues related to privacy, data bias, and the accuracy and reliability of information generated by AI. The focus of prior research has primarily centred on the broad applications of LLMs in medicine. To the author's knowledge, this is, the first article that consolidates current and pertinent literature on LLMs to examine its potential in primary care. The objectives of this paper are not only to summarise the potential benefits, risks and challenges of using LLMs in primary care, but also to offer insights into considerations that primary care clinicians should take into account when deciding to adopt and integrate such technologies into their clinical practice.",Andrew A
38028668,Overview of Chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science.,2023,Frontiers in artificial intelligence,,,,"The release of ChatGPT has initiated new thinking about AI-based Chatbot and its application and has drawn huge public attention worldwide. Researchers and doctors have started thinking about the promise and application of AI-related large language models in medicine during the past few months. Here, the comprehensive review highlighted the overview of Chatbot and ChatGPT and their current role in medicine. Firstly, the general idea of Chatbots, their evolution, architecture, and medical use are discussed. Secondly, ChatGPT is discussed with special emphasis of its application in medicine, architecture and training methods, medical diagnosis and treatment, research ethical issues, and a comparison of ChatGPT with other NLP models are illustrated. The article also discussed the limitations and prospects of ChatGPT. In the future, these large language models and ChatGPT will have immense promise in healthcare. However, more research is needed in this direction.",Chakraborty C; Pal S; Bhattacharya M; Dash S; Lee SS
38524814,Using artificial intelligence for exercise prescription in personalised health promotion: A critical evaluation of OpenAI's GPT-4 model.,2024,Biology of sport,,,,"The rise of artificial intelligence (AI) applications in healthcare provides new possibilities for personalized health management. AI-based fitness applications are becoming more common, facilitating the opportunity for individualised exercise prescription. However, the use of AI carries the risk of inadequate expert supervision, and the efficacy and validity of such applications have not been thoroughly investigated, particularly in the context of diverse health conditions. The aim of the study was to critically assess the efficacy of exercise prescriptions generated by OpenAI's Generative Pre-Trained Transformer 4 (GPT-4) model for five example patient profiles with diverse health conditions and fitness goals. Our focus was to assess the model's ability to generate exercise prescriptions based on a singular, initial interaction, akin to a typical user experience. The evaluation was conducted by leading experts in the field of exercise prescription. Five distinct scenarios were formulated, each representing a hypothetical individual with a specific health condition and fitness objective. Upon receiving details of each individual, the GPT-4 model was tasked with generating a 30-day exercise program. These AI-derived exercise programs were subsequently subjected to a thorough evaluation by experts in exercise prescription. The evaluation encompassed adherence to established principles of frequency, intensity, time, and exercise type; integration of perceived exertion levels; consideration for medication intake and the respective medical condition; and the extent of program individualization tailored to each hypothetical profile. The AI model could create general safety-conscious exercise programs for various scenarios. However, the AI-generated exercise prescriptions lacked precision in addressing individual health conditions and goals, often prioritizing excessive safety over the effectiveness of training. The AI-based approach aimed to ensure patient improvement through gradual increases in training load and intensity, but the model's potential to fine-tune its recommendations through ongoing interaction was not fully satisfying. AI technologies, in their current state, can serve as supplemental tools in exercise prescription, particularly in enhancing accessibility for individuals unable to access, often costly, professional advice. However, AI technologies are not yet recommended as a substitute for personalized, progressive, and health condition-specific prescriptions provided by healthcare and fitness professionals. Further research is needed to explore more interactive use of AI models and integration of real-time physiological feedback.",Dergaa I; Saad HB; El Omri A; Glenn JM; Clark CCT; Washif JA; Guelmami N; Hammouda O; Al-Horani RA; Reynoso-Sanchez LF; Romdhani M; Paineiras-Domingos LL; Vancini RL; Taheri M; Mataruna-Dos-Santos LJ; Trabelsi K; Chtourou H; Zghibi M; Eken O; Swed S; Aissa MB; Shawki HH; El-Seedi HR; Mujika I; Seiler S; Zmijewski P; Pyne DB; Knechtle B; Asif IM; Drezner JA; Sandbakk O; Chamari K
37432530,Use of ChatGPT on Taiwan's Examination for Medical Doctors.,2024,Annals of biomedical engineering,,,,"The study evaluates the performance of OpenAI's GPT-3 model on answering medical exam questions from Staged Senior Professional and Technical Examinations Regulations for Medical Doctors in the field of internal medicine. The study used the official API to connect the questionnaire with the ChatGPT model, and the results showed that the AI model performed reasonably well, with the highest score of 8/13 in chest medicine. However, the overall performance of the AI model was limited, with only chest medicine scoring more than 60. ChatGPT scored relatively high in Chest medicine, Gastroenterology, and general medicine. One of the limitations of the study is the use of non-English text, which may affect the model's performance as the model is primarily trained on English text.",Kao YS; Chuang WK; Yang J
37864817,The Need for Artificial Intelligence Curriculum in Military Medical Education.,2024,Military medicine,,,,"The success of deep-learning algorithms in analyzing complex structured and unstructured multidimensional data has caused an exponential increase in the amount of research devoted to the applications of artificial intelligence (AI) in medicine in the past decade. Public release of large language models like ChatGPT the past year has generated an unprecedented storm of excitement and rumors of machine intelligence finally reaching or even surpassing human capability in detecting meaningful signals in complex multivariate data. Such enthusiasm, however, is met with an equal degree of both skepticism and fear over the social, legal, and moral implications of such powerful technology with relatively little safeguards or regulations on its development. The question remains in medicine of how to harness the power of AI to improve patient outcomes by increasing the diagnostic accuracy and treatment precision provided by medical professionals. Military medicine, given its unique mission and resource constraints,can benefit immensely from such technology. However, reaping such benefits hinges on the ability of the rising generations of military medical professionals to understand AI algorithms and their applications. Additionally, they should strongly consider working with them as an adjunct decision-maker and view them as a colleague to access and harness relevant information as opposed to something to be feared. Ideas expressed in this commentary were formulated by a military medical student during a two-month research elective working on a multidisciplinary team of computer scientists and clinicians at the National Library of Medicine advancing the state of the art of AI in medicine. A motivation to incorporate AI in the Military Health System is provided, including examples of applications in military medicine. Rationale is then given for inclusion of AI in education starting in medical school as well as a prudent implementation of these algorithms in a clinical workflow during graduate medical education. Finally, barriers to implementation are addressed along with potential solutions. The end state is not that rising military physicians are technical experts in AI; but rather that they understand how they can leverage its rapidly evolving capabilities to prepare for a future where AI will have a significant role in clinical care. The overall goal is to develop trained clinicians that can leverage these technologies to improve the Military Health System.",Spirnak JR; Antani S
39467455,Drug repurposing in status epilepticus.,2024,Epilepsy & behavior : E&B,,,,"The treatment of status epilepticus (SE) has changed little in the last 20 years, largely because of the high risks and costs of new drug development for SE. Moreover, SE poses specific challenges to drug development, such as patient diversity, logistical hurdles, and the need for acute treatment strategies that differ from chronic seizure prevention. This has reduced the appetite of industry to develop new drugs in this area. Drug repurposing is an attractive approach to address this unmet need. It offers significant advantages, including reduced development time, lower costs, and higher success rates, compared to novel drug development. Here I demonstrate how novel methods integrating biological knowledge and computational methods can be applied to drug repurposing in status epilepticus. Biological approaches focus on addressing mechanisms underlying drug resistance in SE (using for example ketamine, tacrolimus and safinamide) and longer-term consequences (using for example omaveloxolone, celecoxib and losartan). Additionally, artificial intelligence platforms, such as ChatGPT, can rapidly generate promising drug lists, while in silico methods can analyze gene expression changes to predict molecular targets. Combining AI and in silico approaches has identified several candidate drugs, including metformin, sirolimus and riluzole, for SE treatment. Despite the promise of repurposing, challenges remain, such as intellectual property issues and regulatory barriers. Nonetheless, drug repurposing presents a viable solution to the high costs and slow progress of traditional drug development for SE. This paper is based on a presentation made at the 9th London-Innsbruck Colloquium on Status Epilepticus and Acute Seizures, in April 2024.",Walker MC
37779171,Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.,2023,Scientific reports,,,,"The United States Medical Licensing Examination (USMLE) has been a subject of performance study for artificial intelligence (AI) models. However, their performance on questions involving USMLE soft skills remains unexplored. This study aimed to evaluate ChatGPT and GPT-4 on USMLE questions involving communication skills, ethics, empathy, and professionalism. We used 80 USMLE-style questions involving soft skills, taken from the USMLE website and the AMBOSS question bank. A follow-up query was used to assess the models' consistency. The performance of the AI models was compared to that of previous AMBOSS users. GPT-4 outperformed ChatGPT, correctly answering 90% compared to ChatGPT's 62.5%. GPT-4 showed more confidence, not revising any responses, while ChatGPT modified its original answers 82.5% of the time. The performance of GPT-4 was higher than that of AMBOSS's past users. Both AI models, notably GPT-4, showed capacity for empathy, indicating AI's potential to meet the complex interpersonal, ethical, and professional demands intrinsic to the practice of medicine.",Brin D; Sorin V; Vaid A; Soroush A; Glicksberg BS; Charney AW; Nadkarni G; Klang E
38247132,Potential applications of ChatGPT in obstetrics and gynecology in Korea: a review article.,2024,Obstetrics & gynecology science,,,,"The use of chatbot technology, particularly chat generative pre-trained transformer (ChatGPT) with an impressive 175 billion parameters, has garnered significant attention across various domains, including Obstetrics and Gynecology (OBGYN). This comprehensive review delves into the transformative potential of chatbots with a special focus on ChatGPT as a leading artificial intelligence (AI) technology. Moreover, ChatGPT harnesses the power of deep learning algorithms to generate responses that closely mimic human language, opening up myriad applications in medicine, research, and education. In the field of medicine, ChatGPT plays a pivotal role in diagnosis, treatment, and personalized patient education. Notably, the technology has demonstrated remarkable capabilities, surpassing human performance in OBGYN examinations, and delivering highly accurate diagnoses. However, challenges remain, including the need to verify the accuracy of the information and address the ethical considerations and limitations. In the wide scope of chatbot technology, AI systems play a vital role in healthcare processes, including documentation, diagnosis, research, and education. Although promising, the limitations and occasional inaccuracies require validation by healthcare professionals. This review also examined global chatbot adoption in healthcare, emphasizing the need for user awareness to ensure patient safety. Chatbot technology holds great promise in OBGYN and medicine, offering innovative solutions while necessitating responsible integration to ensure patient care and safety.",Lee Y; Kim SY
38242382,Comprehensive evaluation of molecule property prediction with ChatGPT.,2024,"Methods (San Diego, Calif.)",,,,"The versatility of ChatGPT in performing a diverse range of tasks has elicited considerable interest on its potential applications within professional fields. Taking drug discovery as a testbed, this paper provides a comprehensive evaluation of ChatGPT's ability on molecule property prediction. The study focuses on three aspects: 1) Effects of different prompt settings, where we investigate the impact of varying prompts on the prediction outcomes of ChatGPT; 2) Comprehensive evaluation on molecule property prediction, where we conduct a comprehensive evaluation on 53 ADMET-related endpoints; 3) Analysis of ChatGPT's potential and limitations, where we make comparisons with models tailored for molecule property prediction, thus gaining a more accurate understanding of ChatGPT's capabilities and limitations in this area. Through comprehensive evaluation, we find that 1) With appropriate prompt settings, ChatGPT can attain satisfactory prediction outcomes that are competitive with specialized models designed for those tasks. 2) Prompt settings significantly affect ChatGPT's performance. Among all prompt settings, the strategy of selecting examples in few-shot has the greatest impact on results. Scaffold sampling greatly outperforms random sampling. 3) The capacity of ChatGPT to accomplish high-precision predictions is significantly influenced by the quality of examples provided, which may constrain its practical applicability in real-world scenarios. This work highlights ChatGPT's potential and limitations on molecule property prediction, which we hope can inspire future design and evaluation of Large Language Models within scientific domains.",Cai X; Lai H; Wang X; Wang L; Liu W; Wang Y; Wang Z; Cao D; Zeng X
38562449,Bioinformatics and biomedical informatics with ChatGPT: Year one review.,2024,ArXiv,,,,"The year 2023 marked a significant surge in the exploration of applying large language model (LLM) chatbots, notably ChatGPT, across various disciplines. We surveyed the applications of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.",Wang J; Cheng Z; Yao Q; Liu L; Xu D; Hu G
39364207,Bioinformatics and biomedical informatics with ChatGPT: Year one review.,2024,"Quantitative biology (Beijing, China)",,,,"The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre-trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.",Wang J; Cheng Z; Yao Q; Liu L; Xu D; Hu G
38384298,"Generative artificial intelligence in drug discovery: basic framework, recent advances, challenges, and opportunities.",2024,Frontiers in pharmacology,,,,"There are two main ways to discover or design small drug molecules. The first involves fine-tuning existing molecules or commercially successful drugs through quantitative structure-activity relationships and virtual screening. The second approach involves generating new molecules through de novo drug design or inverse quantitative structure-activity relationship. Both methods aim to get a drug molecule with the best pharmacokinetic and pharmacodynamic profiles. However, bringing a new drug to market is an expensive and time-consuming endeavor, with the average cost being estimated at around $2.5 billion. One of the biggest challenges is screening the vast number of potential drug candidates to find one that is both safe and effective. The development of artificial intelligence in recent years has been phenomenal, ushering in a revolution in many fields. The field of pharmaceutical sciences has also significantly benefited from multiple applications of artificial intelligence, especially drug discovery projects. Artificial intelligence models are finding use in molecular property prediction, molecule generation, virtual screening, synthesis planning, repurposing, among others. Lately, generative artificial intelligence has gained popularity across domains for its ability to generate entirely new data, such as images, sentences, audios, videos, novel chemical molecules, etc. Generative artificial intelligence has also delivered promising results in drug discovery and development. This review article delves into the fundamentals and framework of various generative artificial intelligence models in the context of drug discovery via de novo drug design approach. Various basic and advanced models have been discussed, along with their recent applications. The review also explores recent examples and advances in the generative artificial intelligence approach, as well as the challenges and ongoing efforts to fully harness the potential of generative artificial intelligence in generating novel drug molecules in a faster and more affordable manner. Some clinical-level assets generated form generative artificial intelligence have also been discussed in this review to show the ever-increasing application of artificial intelligence in drug discovery through commercial partnerships.",Gangwal A; Ansari A; Ahmad I; Azad AK; Kumarasamy V; Subramaniyan V; Wong LS
39278424,Editorial Commentary: The Scope of Medical Research Concerning ChatGPT Remains Limited by Lack of Originality.,2024,Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association,,,,"There is no shortage of literature surrounding ChatGPT and whether this large language model can provide accurate and clinically relevant information in response to simulated patient queries. Unfortunately, there is a shortage of literature addressing important considerations beyond these experimental and entertaining uses. Indeed, a trend for redundancy has emerged where most of the literature has applied ChatGPT to the same tasks while simply swapping the subject matter, resulting in a failure to expand the impact and reach of this potentially transformational artificial intelligence (AI) solution. Instead, research addressing pressing health care challenges and a renewed focus on novel use cases will allow for more meaningful research initiatives, product development, and tangible changes at both the system and point-of-care levels. Current target areas of interest in medicine that remain obstacles to patient care include prior authorization, administrative burden, documentation generation, medical triage and diagnosis, and patient communication efficiency. To advance this area of research toward such meaningful applications, a structured framework is necessary. Such frameworks should include problem identification; definition of key performance indicators; multidisciplinary and multi-institutional collaboration of those with domain expertise, including AI engineers and information technology specialists; policy and strategy development driven by executive-level personnel; institutional financial support and investment from key stakeholders for AI infrastructure and maintenance; and critical assessment of AI performance, bias, and equity.",Kunze KN
39583413,The Role of Artificial Intelligence in Diagnostic Radiology.,2024,Cureus,,,,"This article explores the significant impact of artificial intelligence (AI) on radiology through a comprehensive analysis of eight articles published between 2018 and 2024. With the rapid progress of modern science, the diagnostic methods in medicine are subject to change, which creates the need to consider and evaluate new diagnostic techniques such as artificial intelligence. In our study, we will evaluate the diagnostic accuracy of artificial intelligence and radiological image interpretation, as well as the pros and cons of its use and future development prospects in this field. In this article, we also consider the possibility of using GPT-4 for image analysis in radiology. Artificial intelligence is a revolutionary medical tool that can change diagnostic strategies to improve the quality of medical services.",Strubchevska O; Kozyk M; Kozyk A; Strubchevska K
37638266,Exploring the Potential and Limitations of Chat Generative Pre-trained Transformer (ChatGPT) in Generating Board-Style Dermatology Questions: A Qualitative Analysis.,2023,Cureus,,,,"This article investigates the limitations of Chat Generative Pre-trained Transformer (ChatGPT), a language model developed by OpenAI, as a study tool in dermatology. The study utilized ChatPDF, an application that integrates PDF files with ChatGPT, to generate American Board of Dermatology Applied Exam (ABD-AE)-style questions from continuing medical education articles from the Journal of the American Board of Dermatology. A qualitative analysis of the questions was conducted by two board-certified dermatologists, assessing accuracy, complexity, and clarity. Out of 40 questions generated, only 16 (40%) were deemed accurate and appropriate for ABD-AE study preparation. The remaining questions exhibited limitations, including low complexity, lack of clarity, and inaccuracies. The findings highlight the challenges faced by ChatGPT in understanding the domain-specific knowledge required in dermatology. Moreover, the model's inability to comprehend the context and generate high-quality distractor options, as well as the absence of image generation capabilities, further hinders its usefulness. The study emphasizes that while ChatGPT may aid in generating simple questions, it cannot replace the expertise of dermatologists and medical educators in developing high-quality, board-style questions that effectively evaluate candidates' knowledge and reasoning abilities.",Ayub I; Hamann D; Hamann CR; Davis MJ
37746684,Artificial Intelligence: its Future and Impact on Acute Medicine.,2023,Acute medicine,,,,"This commentary explores the potential impact of artificial intelligence (AI) in acute medicine, considering its possibilities and challenges. With its ability to simulate human intelligence, AI holds the promise for supporting timely decision-making and interventions in acute care. While AI has significantly contributed to improvements in various sectors, its implementation in healthcare remains limited. The development of AI tools tailored to acute medicine can improve clinical decision-making, and AI's role in streamlining administrative tasks, exemplified by ChatGPT, may offer immediate benefits. However, challenges include uniform data collection, privacy, bias, and preserving the doctor-patient relationship. Collaboration among AI researchers, healthcare professionals, and policymakers is crucial to harness the potential of AI in acute medicine and create a future where advanced technologies synergistically enhance human expertise.",Schinkel M; Paranjape K; Bhagirath SC; Nanayakkara P
38724772,"""Incorporating large language models into academic neurosurgery: embracing the new era"".",2024,Neurosurgical review,,,,"This correspondence examines how LLMs, such as ChatGPT, have an effect on academic neurosurgery. It emphasises the potential of LLMs in enhancing clinical decision-making, medical education, and surgical practice by providing real-time access to extensive medical literature and data analysis. Although this correspondence acknowledges the opportunities that come with the incorporation of LLMs, it also discusses challenges, such as data privacy, ethical considerations, and regulatory compliance. Additionally, recent studies have assessed the effectiveness of LLMs in perioperative patient communication and medical education, and stressed the need for cooperation between neurosurgeons, data scientists, and AI experts to address these challenges and fully exploit the potential of LLMs in improving patient care and outcomes in neurosurgery.",Aamir A; Hafsa H
37450276,Towards Precision Medicine in Spinal Surgery: Leveraging AI Technologies.,2024,Annals of biomedical engineering,,,,"This critique explores the implications of integrating artificial intelligence (AI) technology, specifically OpenAI's advanced language model GPT-4 and its interface, ChatGPT, into the field of spinal surgery. It examines the potential effects of algorithmic bias, unique challenges in surgical domains, access and equity issues, cost implications, global disparities in technology adoption, and the concept of technological determinism. It posits that biases present in AI training data may impact the quality and equity of healthcare outcomes. Challenges related to the unique nature of surgical procedures, including real-time decision-making, are also addressed. Concerns over access, equity, and cost implications underscore the potential for exacerbated healthcare disparities. Global disparities in technology adoption highlight the importance of global collaboration, technology transfer, and capacity building. Finally, the critique challenges the notion of technological determinism, emphasizing the continued importance of human judgement and patient-care provider relationship in healthcare. The critique calls for a comprehensive evaluation of AI technology integration in healthcare to ensure equitable and quality care.",Lawson McLean A
37758854,"Response Letter to ""Testing ChatGPT's Capabilities for Social Media Content Analysis"".",2024,Aesthetic plastic surgery,,,,"This editorial discusses the innovative application of ChatGPT in categorizing and analysing social media content, with a focus on aesthetic medical fields. It highlights the revolutionary capabilities of AI in enhancing efficiency and objectivity over traditional human-driven methods. Alongside the benefits, it also considers ethical concerns surrounding privacy, consent, and inherent biases within AI models. The article explores the complexity of categorization, the limitations in understanding human nuances, and the impact on human creativity, including specific applications such as SEO writing. It concludes by emphasizing the need for careful integration of AI in our interconnected world, balancing technological advancements with ethical considerations and a recognition of the unique attributes of human intellect. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Buzzaccarini G; Degliuomini RS; Borin M
39359332,OpenAI o1-Preview vs. ChatGPT in Healthcare: A New Frontier in Medical AI Reasoning.,2024,Cureus,,,,"This editorial explores the recent advancements in generative artificial intelligence with the newly-released OpenAI o1-Preview, comparing its capabilities to the traditional ChatGPT (GPT-4) model, particularly in the context of healthcare. While ChatGPT has shown many applications for general medical advice and patient interactions, OpenAI o1-Preview introduces new features with advanced reasoning skills using a chain of thought processes that could enable users to tackle more complex medical queries such as genetic disease discovery, multi-system or complex disease care, and medical research support. The article explores some of the new model's potential and other aspects that may affect its usage, like slower response times due to its extensive reasoning approach yet highlights its potential for reducing hallucinations and offering more accurate outputs for complex medical problems. Ethical challenges, data diversity, access equity, and transparency are also discussed, identifying key areas for future research, including optimizing the use of both models in tandem for healthcare applications. The editorial concludes by advocating for collaborative exploration of all large language models (LLMs), including the novel OpenAI o1-Preview, to fully utilize their transformative potential in medicine and healthcare delivery. This model, with its advanced reasoning capabilities, presents an opportunity to empower healthcare professionals, policymakers, and computer scientists to work together in transforming patient care, accelerating medical research, and enhancing healthcare outcomes. By optimizing the use of several LLM models in tandem, healthcare systems may enhance efficiency and precision, as well as mitigate previous LLM challenges, such as ethical concerns, access disparities, and technical limitations, steering to a new era of artificial intelligence (AI)-driven healthcare.",Temsah MH; Jamal A; Alhasan K; Temsah AA; Malki KH
37605022,Testing ChatGPT's Capabilities for Social Media Content Analysis.,2024,Aesthetic plastic surgery,,,,"This letter explores the potential of artificial intelligence models, specifically ChatGPT, for content analysis, namely for categorizing social media posts. The primary focus is on Twitter posts with the hashtag #plasticsurgery. Through integrating Python with the OpenAI API, the study provides a designed prompt to categorize tweet content. Looking forward, the utilization of AI in content analysis presents promising opportunities for advancing understanding of complex social phenomena.Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine Ratings, please refer to Table of Contents or online Instructions to Authors http://www.springer.com/00266 .",Haman M; Skolnik M
37362116,The Urgent Need for Healthcare Workforce Upskilling and Ethical Considerations in the Era of AI-Assisted Medicine.,2023,Indian journal of otolaryngology and head and neck surgery : official publication of the Association of Otolaryngologists of India,,,,"This letter is in response to the article ""Enhancing India's Health Care during COVID Era: Role of Artificial Intelligence and Algorithms"". While the integration of AI has the potential to improve patient outcomes and reduce the workload of healthcare professionals, there is a need for significant training and upskilling of healthcare providers. There are ethical and privacy concerns related to the use of AI in healthcare, which must be accompanied by rigorous guidelines. One solution to the overburdened healthcare systems in India is the use of new language generation models like ChatGPT to assist healthcare workers in writing discharge summaries. By using these technologies responsibly, we can improve healthcare outcomes and alleviate the burden on overworked healthcare professionals.",Rao D
39755237,Introduction to Artificial Intelligence and Machine Learning in Pathology and Medicine: Generative and Nongenerative Artificial Intelligence Basics.,2025,"Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc",,,,"This manuscript serves as an introduction to a comprehensive 7-part review article series on artificial intelligence (AI) and machine learning (ML) and their current and future influence within pathology and medicine. This introductory review provides a comprehensive grasp of this fast-expanding realm and its potential to transform medical diagnosis, workflow, research, and education. Fundamental terminology employed in AI-ML is covered using an extensive dictionary. The article also provides a broad overview of the main domains in the AI-ML field, encompassing both generative and nongenerative (traditional) AI, thereby serving as a primer to the other 6 review articles in this series that describe the details about statistics, regulations, bias, ethical dilemmas, and ML-Ops in AI-ML. The intent of these review articles is to better equip individuals who are or will be working in an AI-enabled health care system.",Rashidi HH; Pantanowitz J; Hanna MG; Tafti AP; Sanghani P; Buchinsky A; Fennell B; Deebajah M; Wheeler S; Pearce T; Abukhiran I; Robertson S; Palmer O; Gur M; Tran NK; Pantanowitz L
40424337,"""It is important to consult"" a linguist: Verb-Argument Constructions in ChatGPT and human experts' medical and financial advice.",2025,PloS one,,,,"This paper adopts a Usage-Based Construction Grammar perspective to compare human- and AI-generated language, focusing on Verb-Argument Constructions (VACs) as a lens for analysis. Specifically, we examine solicited advice texts in two domains-Finance and Medicine-produced by humans and ChatGPT across different GPT models (3.5, 4, and 4o) and interfaces (3.5 Web vs. 3.5 API). Our findings reveal broad consistency in the frequency and distribution of the most common VACs across human- and AI-generated texts, though ChatGPT exhibits a slightly higher reliance on the most frequent constructions. A closer examination of the verbs occupying these constructions uncovers significant differences in the meanings conveyed, with a notable growth away from human-like language production in macro level perspectives (e.g., length) and towards humanlike verb-VAC patterns with newer models. These results underscore the potential of VACs as a powerful tool for analyzing AI-generated language and tracking its evolution over time.",Casal JE; Stewart CM; Windsor AJ
36869927,Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios.,2023,Journal of medical systems,,,,"This paper aims to highlight the potential applications and limits of a large language model (LLM) in healthcare. ChatGPT is a recently developed LLM that was trained on a massive dataset of text for dialogue with users. Although AI-based language models like ChatGPT have demonstrated impressive capabilities, it is uncertain how well they will perform in real-world scenarios, particularly in fields such as medicine where high-level and complex thinking is necessary. Furthermore, while the use of ChatGPT in writing scientific articles and other scientific outputs may have potential benefits, important ethical concerns must also be addressed. Consequently, we investigated the feasibility of ChatGPT in clinical and research scenarios: (1) support of the clinical practice, (2) scientific production, (3) misuse in medicine and research, and (4) reasoning about public health topics. Results indicated that it is important to recognize and promote education on the appropriate use and potential pitfalls of AI-based LLMs in medicine.",Cascella M; Montomoli J; Bellini V; Bignami E
36841840,Can artificial intelligence help for scientific writing?,2023,"Critical care (London, England)",,,,"This paper discusses the use of Artificial Intelligence Chatbot in scientific writing. ChatGPT is a type of chatbot, developed by OpenAI, that uses the Generative Pre-trained Transformer (GPT) language model to understand and respond to natural language inputs. AI chatbot and ChatGPT in particular appear to be useful tools in scientific writing, assisting researchers and scientists in organizing material, generating an initial draft and/or in proofreading. There is no publication in the field of critical care medicine prepared using this approach; however, this will be a possibility in the next future. ChatGPT work should not be used as a replacement for human judgment and the output should always be reviewed by experts before being used in any critical decision-making or application. Moreover, several ethical issues arise about using these tools, such as the risk of plagiarism and inaccuracies, as well as a potential imbalance in its accessibility between high- and low-income countries, if the software becomes paying. For this reason, a consensus on how to regulate the use of chatbots in scientific writing will soon be required.",Salvagno M; Taccone FS; Gerli AG
37215063,"ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.",2023,Frontiers in artificial intelligence,,,,"This paper presents an analysis of the advantages, limitations, ethical considerations, future prospects, and practical applications of ChatGPT and artificial intelligence (AI) in the healthcare and medical domains. ChatGPT is an advanced language model that uses deep learning techniques to produce human-like responses to natural language inputs. It is part of the family of generative pre-training transformer (GPT) models developed by OpenAI and is currently one of the largest publicly available language models. ChatGPT is capable of capturing the nuances and intricacies of human language, allowing it to generate appropriate and contextually relevant responses across a broad spectrum of prompts. The potential applications of ChatGPT in the medical field range from identifying potential research topics to assisting professionals in clinical and laboratory diagnosis. Additionally, it can be used to help medical students, doctors, nurses, and all members of the healthcare fraternity to know about updates and new developments in their respective fields. The development of virtual assistants to aid patients in managing their health is another important application of ChatGPT in medicine. Despite its potential applications, the use of ChatGPT and other AI tools in medical writing also poses ethical and legal concerns. These include possible infringement of copyright laws, medico-legal complications, and the need for transparency in AI-generated content. In conclusion, ChatGPT has several potential applications in the medical and healthcare fields. However, these applications come with several limitations and ethical considerations which are presented in detail along with future prospects in medicine and healthcare.",Dave T; Athaluri SA; Singh S
37325497,Large language models and the emergence phenomena.,2023,European journal of radiology open,,,,"This perspective explores the potential of emergence phenomena in large language models (LLMs) to transform data management and analysis in radiology. We provide a concise explanation of LLMs, define the concept of emergence in machine learning, offer examples of potential applications within the radiology field, and discuss risks and limitations. Our goal is to encourage radiologists to recognize and prepare for the impact this technology may have on radiology and medicine in the near future.",Sorin V; Klang E
39689760,Generative Artificial Intelligence in Pathology and Medicine: A Deeper Dive.,2025,"Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc",,,,"This review article builds upon the introductory piece in our 7-part series, delving deeper into the transformative potential of generative artificial intelligence (Gen AI) in pathology and medicine. The article explores the applications of Gen AI models in pathology and medicine, including the use of custom chatbots for diagnostic report generation, synthetic image synthesis for training new models, data set augmentation, hypothetical scenario generation for educational purposes, and the use of multimodal along with multiagent models. This article also provides an overview of the common categories within Gen AI models, discussing open-source and closed-source models, as well as specific examples of popular models such as GPT-4, Llama, Mistral, DALL-E, Stable Diffusion, and their associated frameworks (eg, transformers, generative adversarial networks, diffusion-based neural networks), along with their limitations and challenges, especially within the medical domain. We also review common libraries and tools that are currently deemed necessary to build and integrate such models. Finally, we look to the future, discussing the potential impact of Gen AI on health care, including benefits, challenges, and concerns related to privacy, bias, ethics, application programming interface costs, and security measures.",Rashidi HH; Pantanowitz J; Chamanzar A; Fennell B; Wang Y; Gullapalli RR; Tafti A; Deebajah M; Albahra S; Glassy E; Hanna MG; Pantanowitz L
37792344,Can ChatGPT Provide Quality Information on Integrative Oncology? A Brief Report.,2024,Journal of integrative and complementary medicine,,,,"This short report evaluated the accuracy and quality of information provided by ChatGPT regarding the use of complementary and integrative medicine for cancer. Using the QUality Evaluation Scoring Tool, a panel of 12 reviewers assessed ChatGPT's responses to 8 questions. The study found that ChatGPT provided moderate-quality responses that were relatively unbiased and not misleading. However, the chatbot's inability to reference specific scientific studies was a significant limitation. Patients with cancer should not rely on ChatGPT for clinical advice until further systematic validation. Future studies should examine how patients perceive ChatGPT's information and its impact on communication with health care professionals.",Lam CS; Hua R; Koon HK; Zhou KR; Lam TTN; Lee CP; Lin WL; Wong CL; Lau YM; Loong HH; Chung VC; Cheung YT
38761230,Publication Trends and Hot Spots of ChatGPT's Application in the Medicine.,2024,Journal of medical systems,,,,"This study aimed to analyze the current landscape of ChatGPT application in the medical field, assessing the current collaboration patterns and research topic hotspots to understand the impact and trends. By conducting a search in the Web of Science, we collected literature related to the applications of ChatGPT in medicine, covering the period from January 1, 2000 up to January 16, 2024. Bibliometric analyses were performed using CiteSpace (V6.2., Drexel University, PA, USA) and Microsoft Excel (Microsoft Corp.,WA, USA) to map the collaboration among countries/regions, the distribution of institutions and authors, and clustering of keywords. A total of 574 eligible articles were included, with 97.74% published in 2023. These articles span various disciplines, particularly in Health Care Sciences Services, with extensive international collaboration involving 73 countries. In terms of countries/regions studied, USA, India, and China led in the number of publications. USA ot only published nearly half of the total number of papers but also exhibits a highest collaborative capability. Regarding the co-occurrence of institutions and scholars, the National University of Singapore and Harvard University held significant influence in the cooperation network, with the top three authors in terms of publications being Wiwanitkit V (10 articles), Seth I (9 articles), Klang E (7 articles), and Kleebayoon A (7 articles). Through keyword clustering, the study identified 9 research theme clusters, among which ""digital health""was not only the largest in scale but also had the most citations. The study highlights ChatGPT's cross-disciplinary nature and collaborative research in medicine, showcasing its growth potential, particularly in digital health and clinical decision support. Future exploration should examine the socio-economic and cultural impacts of this trend, along with ChatGPT's specific technical uses in medical practice.",Li ZQ; Wang XF; Liu JP
39254919,Assessing knowledge about medical physics in language-generative AI with large language model: using the medical physicist exam.,2024,Radiological physics and technology,,,,"This study aimed to evaluate the performance for answering the Japanese medical physicist examination and providing the benchmark of knowledge about medical physics in language-generative AI with large language model. We used questions from Japan's 2018, 2019, 2020, 2021 and 2022 medical physicist board examinations, which covered various question types, including multiple-choice questions, and mainly focused on general medicine and medical physics. ChatGPT-3.5 and ChatGPT-4.0 (OpenAI) were used. We compared the AI-based answers with the correct ones. The average accuracy rates were 42.2 +/- 2.5% (ChatGPT-3.5) and 72.7 +/- 2.6% (ChatGPT-4), showing that ChatGPT-4 was more accurate than ChatGPT-3.5 [all categories (except for radiation-related laws and recommendations/medical ethics): p value < 0.05]. Even with the ChatGPT model with higher accuracy, the accuracy rates were less than 60% in two categories; radiation metrology (55.6%), and radiation-related laws and recommendations/medical ethics (40.0%). These data provide the benchmark for knowledge about medical physics in ChatGPT and can be utilized as basic data for the development of various medical physics tools using ChatGPT (e.g., radiation therapy support tools with Japanese input).",Kadoya N; Arai K; Tanaka S; Kimura Y; Tozuka R; Yasui K; Hayashi N; Katsuta Y; Takahashi H; Inoue K; Jingu K
38691404,Integrating Text and Image Analysis: Exploring GPT-4V's Capabilities in Advanced Radiological Applications Across Subspecialties.,2024,Journal of medical Internet research,,,,This study demonstrates that GPT-4V outperforms GPT-4 across radiology subspecialties in analyzing 207 cases with 1312 images from the Radiological Society of North America Case Collection.,Busch F; Han T; Makowski MR; Truhn D; Bressem KK; Adams L
39451178,An investigative analysis - ChatGPT's capability to excel in the Polish speciality exam in pathology.,2024,Polish journal of pathology : official journal of the Polish Society of Pathologists,,,,"This study evaluates the effectiveness of the ChatGPT-3.5 language model in providing correct answers to pathomorphology questions as required by the State Speciality Examination (PES). Artificial intelligence (AI) in medicine is generating increasing interest, but its potential needs thorough evaluation. A set of 119 exam questions by type and subtype were used, which were posed to the ChatGPT-3.5 model. Performance was analysed with regard to the success rate in different question categories and subtypes. ChatGPT-3.5 achieved a performance of 45.38%, which is significantly below the minimum PES pass threshold. The results achieved varied by question type and subtype, with better results in questions requiring ""comprehension and critical thinking"" than ""memory"". The analysis shows that, although ChatGPT-3.5 can be a useful teaching tool, its performance in providing correct answers to pathomorphology questions is significantly lower than that of human respondents. This conclusion highlights the need to further improve the AI model, taking into account the specificities of the medical field. Artificial intelligence can be helpful, but it cannot fully replace the experience and knowledge of specialists.",Bielowka M; Kufel J; Rojek M; Kaczynska D; Czogalik L; Mitrega A; Bartnikowska W; Kondol D; Palkij K; Mielcarska S
39550662,[Not Available].,2024,Recenti progressi in medicina,,,,"This study explores the potential use of ChatGPT, an AI-based language model, in assessing herbal-drug interactions (HDi) to enhance clinical decision-making. HDi can pose significant health risks by reducing drug efficacy or causing unwanted side effects. Clinical pharmacists play a key role in identifying these HDIs, and currently, there are limited tools available for checking drug interactions. The research focuses on a case study of a rectal adenocarcinoma patient treated with capecitabine and 26 supplements, which contain a total of 80 herbal substances. ChatGPT 3.5 was asked three questions regarding potential HDIs: ""Are there possible HDIs?"", ""What is the pharmacokinetic mechanism?"", and ""What is the bibliographic source of the interaction?"". The results were reviewed by an oncology clinical pharmacist and compared to existing databases and independent bibliographic research. The findings highlight ChatGPT's advantage in processing large amounts of data quickly, with 16% of interactions classified as ""unlikely"", confirmed by the pharmacist. However, 73% of the suggested mechanisms were false positives, and 4% were categorized as ""hallucinations"". Additionally, most of the bibliographic sources provided by ChatGPT were outdated or unavailable. While ChatGPT proves useful for initial HDI screening, its limitations include outdated data (last updated in January 2022), lack of access to private databases, and occasional inaccuracies. Further applications of AI in this area are recommended, though expert validation remains essential in the clinical decision-making process.",Fiordelisi M; Masucci S; Bianco A; Bellero M; Toma D; Campo N; Zichi C; Marino D; Sperti E; Valabrega G; Cena C; Fazzina G; Gasco A
38371109,"A Comparative Analysis of AI Models in Complex Medical Decision-Making Scenarios: Evaluating ChatGPT, Claude AI, Bard, and Perplexity.",2024,Cureus,,,,"This study rigorously evaluates the performance of four artificial intelligence (AI) language models - ChatGPT, Claude AI, Google Bard, and Perplexity AI - across four key metrics: accuracy, relevance, clarity, and completeness. We used a strong mix of research methods, getting opinions from 14 scenarios. This helped us make sure our findings were accurate and dependable. The study showed that Claude AI performs better than others because it gives complete responses. Its average score was 3.64 for relevance and 3.43 for completeness compared to other AI tools. ChatGPT always did well, and Google Bard had unclear responses, which varied greatly, making it difficult to understand it, so there was no consistency in Google Bard. These results give important information about what AI language models are doing well or not for medical suggestions. They help us use them better, telling us how to improve future tech changes that use AI. The study shows that AI abilities match complex medical scenarios.",Uppalapati VK; Nag DS
38746668,The application of large language models in medicine: A scoping review.,2024,iScience,,,,"This study systematically reviewed the application of large language models (LLMs) in medicine, analyzing 550 selected studies from a vast literature search. LLMs like ChatGPT transformed healthcare by enhancing diagnostics, medical writing, education, and project management. They assisted in drafting medical documents, creating training simulations, and streamlining research processes. Despite their growing utility in assisted diagnosis and improving doctor-patient communication, challenges persisted, including limitations in contextual understanding and the risk of over-reliance. The surge in LLM-related research indicated a focus on medical writing, diagnostics, and patient communication, but highlighted the need for careful integration, considering validation, ethical concerns, and the balance with traditional medical practice. Future research directions suggested a focus on multimodal LLMs, deeper algorithmic understanding, and ensuring responsible, effective use in healthcare.",Meng X; Yan X; Zhang K; Liu D; Cui X; Yang Y; Zhang M; Cao C; Wang J; Wang X; Gao J; Wang YG; Ji JM; Qiu Z; Li M; Qian C; Guo T; Ma S; Wang Z; Guo Z; Lei Y; Shao C; Wang W; Fan H; Tang YD
38656706,Evaluation of ChatGPT and Gemini large language models for pharmacometrics with NONMEM.,2024,Journal of pharmacokinetics and pharmacodynamics,,,,"To assess ChatGPT 4.0 (ChatGPT) and Gemini Ultra 1.0 (Gemini) large language models on NONMEM coding tasks relevant to pharmacometrics and clinical pharmacology. ChatGPT and Gemini were assessed on tasks mimicking real-world applications of NONMEM. The tasks ranged from providing a curriculum for learning NONMEM, an overview of NONMEM code structure to generating code. Prompts in lay language to elicit NONMEM code for a linear pharmacokinetic (PK) model with oral administration and a more complex model with two parallel first-order absorption mechanisms were investigated. Reproducibility and the impact of ""temperature"" hyperparameter settings were assessed. The code was reviewed by two NONMEM experts. ChatGPT and Gemini provided NONMEM curriculum structures combining foundational knowledge with advanced concepts (e.g., covariate modeling and Bayesian approaches) and practical skills including NONMEM code structure and syntax. ChatGPT provided an informative summary of the NONMEM control stream structure and outlined the key NONMEM Translator (NM-TRAN) records needed. ChatGPT and Gemini were able to generate code blocks for the NONMEM control stream from the lay language prompts for the two coding tasks. The control streams contained focal structural and syntax errors that required revision before they could be executed without errors and warnings. The code output from ChatGPT and Gemini was not reproducible, and varying the temperature hyperparameter did not reduce the errors and omissions substantively. Large language models may be useful in pharmacometrics for efficiently generating an initial coding template for modeling projects. However, the output can contain errors and omissions that require correction.",Shin E; Yu Y; Bies RR; Ramanathan M
39130248,Examining the Performance of ChatGPT 3.5 and Microsoft Copilot in Otolaryngology: A Comparative Study with Otolaryngologists' Evaluation.,2024,Indian journal of otolaryngology and head and neck surgery : official publication of the Association of Otolaryngologists of India,,,,"To evaluate the response capabilities, in a public healthcare system otolaryngology job competition examination, of ChatGPT 3.5 and an internet-connected GPT-4 engine (Microsoft Copilot) with the real scores of otolaryngology specialists as the control group. In September 2023, 135 questions divided into theoretical and practical parts were input into ChatGPT 3.5 and an internet-connected GPT-4. The accuracy of AI responses was compared with the official results from otolaryngologists who took the exam, and statistical analysis was conducted using Stata 14.2. Copilot (GPT-4) outperformed ChatGPT 3.5. Copilot achieved a score of 88.5 points, while ChatGPT scored 60 points. Both AIs had discrepancies in their incorrect answers. Despite ChatGPT's proficiency, Copilot displayed superior performance, ranking as the second-best score among the 108 otolaryngologists who took the exam, while ChatGPT was placed 83rd. A chat powered by GPT-4 with internet access (Copilot) demonstrates superior performance in responding to multiple-choice medical questions compared to ChatGPT 3.5.",Mayo-Yanez M; Lechien JR; Maria-Saibene A; Vaira LA; Maniaci A; Chiesa-Estomba CM
39774168,Healthcare professionals and the public sentiment analysis of ChatGPT in clinical practice.,2025,Scientific reports,,,,"To explore the attitudes of healthcare professionals and the public on applying ChatGPT in clinical practice. The successful application of ChatGPT in clinical practice depends on technical performance and critically on the attitudes and perceptions of non-healthcare and healthcare. This study has a qualitative design based on artificial intelligence. This study was divided into five steps: data collection, data cleaning, validation of relevance, sentiment analysis, and content analysis using the K-means algorithm. This study comprised 3130 comments amounting to 1,593,650 words. The dictionary method showed positive and negative emotions such as anger, disgust, fear, sadness, surprise, good, and happy emotions. Healthcare professionals prioritized ChatGPT's efficiency but raised ethical and accountability concerns, while the public valued its accessibility and emotional support but expressed worries about privacy and misinformation. Bridging these perspectives by improving reliability, safeguarding privacy, and clearly defining ChatGPT's role is essential for its practical and ethical integration into clinical practice.",Lu L; Zhu Y; Yang J; Yang Y; Ye J; Ai S; Zhou Q
37952004,Evaluation of prompt engineering strategies for pharmacokinetic data analysis with the ChatGPT large language model.,2024,Journal of pharmacokinetics and pharmacodynamics,,,,"To systematically assess the ChatGPT large language model on diverse tasks relevant to pharmacokinetic data analysis. ChatGPT was evaluated with prototypical tasks related to report writing, code generation, non-compartmental analysis, and pharmacokinetic word problems. The writing task consisted of writing an introduction for this paper from a draft title. The coding tasks consisted of generating R code for semi-logarithmic graphing of concentration-time profiles and calculating area under the curve and area under the moment curve from time zero to infinity. Pharmacokinetics word problems on single intravenous, extravascular bolus, and multiple dosing were taken from a pharmacokinetics textbook. Chain-of-thought and problem separation were assessed as prompt engineering strategies when errors occurred. ChatGPT showed satisfactory performance on the report writing, code generation tasks and provided accurate information on the principles and methods underlying pharmacokinetic data analysis. However, ChatGPT had high error rates in numerical calculations involving exponential functions. The outputs generated by ChatGPT were not reproducible: the precise content of the output was variable albeit not necessarily erroneous for different instances of the same prompt. Incorporation of prompt engineering strategies reduced but did not eliminate errors in numerical calculations. ChatGPT has the potential to become a powerful productivity tool for writing, knowledge encapsulation, and coding tasks in pharmacokinetic data analysis. The poor accuracy of ChatGPT in numerical calculations require resolution before it can be reliably used for PK and pharmacometrics data analysis.",Shin E; Ramanathan M
38903274,Causality Assessment of Adverse Drug Reaction Toxic Epidermal Necrolysis With the Aid of ChatGPT: A Case Report.,2024,Cureus,,,,"Toxic epidermal necrolysis (TEN) is a severe and potentially fatal adverse drug reaction. This case report presents a 19-year-old male with pulmonary tuberculosis undergoing anti-tubercular therapy who developed TEN. The patient had multiple comorbidities including type 1 diabetes mellitus and multisystem atrophy. ChatGPT was utilized alongside conventional methods to assess causality. While conventional scoring systems estimated mortality at 58.3% (SCORTEN) and 12.3% (ABCD-10), ChatGPT yielded divergent scores. Causality assessment using WHO-Uppsala Monitoring Centre (UMC) and Naranjo's scale indicated rifampicin and isoniazid as probable causative agents. However, ChatGPT provided ambiguous results. The study underscores the potential of AI in pharmacovigilance but emphasizes caution due to discrepancies observed. Collaborative utilization of artificial intelligence (AI) with clinical judgment is advocated to enhance diagnostic accuracy and treatment decisions in adverse drug reactions. This case highlights the importance of integrating AI into drug safety systems while acknowledging its limitations to ensure optimal patient care.",Pandya S; Patel C; Sojitra B; Karamata H
37128519,"A Case of Delusional Disorder With Abuse of Isoniazid, Rifampicin, Pyrazinamide, and Ethambutol, the First-Line Anti-tuberculosis Therapy Drugs in India.",2023,Cureus,,,,"Tuberculosis (TB) and mental illnesses frequently coexist and are both extremely common worldwide. Through the National Program for Elimination of Tuberculosis (NTEP), anti-tuberculosis therapy (ATT) medications are used to treat tuberculosis in India. We want to report a case 45-year-old patient from the state of Andhra Pradesh, India with comorbid delusional disorder leading to daily ATT drug consumption for the past 20 years. This unusual presentation demonstrates that abuse of a Schedule ""H"" substance like ATT is also conceivable. To stop ""Off-label"" purchases, strict measures must be taken. Before beginning ATT, evaluating the patient's mental health may be a wise move.",Sathiyamoorthi S; Pentapati SSK; Vullanki SS; Avula VCR; Aravindakshan R
37344394,The potential of 'Segment Anything' (SAM) for universal intelligent ultrasound image guidance.,2023,Bioscience trends,,,,"Ultrasound image guidance is a method often used to help provide care, and it relies on accurate perception of information, and particularly tissue recognition, to guide medical procedures. It is widely used in various scenarios that are often complex. Recent breakthroughs in large models, such as ChatGPT for natural language processing and Segment Anything Model (SAM) for image segmentation, have revolutionized interaction with information. These large models exhibit a revolutionized understanding of basic information, holding promise for medicine, including the potential for universal autonomous ultrasound image guidance. The current study evaluated the performance of SAM on commonly used ultrasound images and it discusses SAM's potential contribution to an intelligent image-guided framework, with a specific focus on autonomous and universal ultrasound image guidance. Results indicate that SAM performs well in ultrasound image segmentation and has the potential to enable universal intelligent ultrasound image guidance.",Ning G; Liang H; Jiang Z; Zhang H; Liao H
38353440,[ChatGPT in clinical practice: prospects and challenges].,2024,Revue medicale suisse,,,,"Virtually unknown to the greater public before November 2022, ChatGPT was made available in open access in Autumn 2022, driving the perspective of artificial intelligence integration to the forefront of daily life. The field of medicine hasn't been left aside, and sparks as much interest as it does questions. Although this tool has considerable potential for use in clinical practice, it, like others, has limitations that need to be clearly understood to avoid misuse. In addition, the legal framework and issues of data confidentiality are currently poorly defined, and clinicians will need to keep a close eye on legislative developments in this area.",Roustan D; Galland-Decker C; Marinoni C; Bastardot F
37885558,Using Artificial Intelligence to Assess the Teratogenic Risk of Vitamin A Supplements.,2023,Cureus,,,,"Vitamin A in high doses has been found to be highly teratogenic, leading to severe fetal abnormalities if exposure occurs during pregnancy. Hence, prescription vitamin A acne medications like isotretinoin are highly regulated via programs such as iPledge, which intend to avert fetal exposure to isotretinoin and to educate healthcare providers, pharmacists, and patients about the significant risks associated with isotretinoin and its appropriate usage conditions. However, over-the-counter (OTC) vitamin A supplements are not subject to these requirements, and calculating the vitamin A content of these supplements can be difficult due to the lack of Food and Drug Administration (FDA) regulations and inconsistencies in labeling. If the necessary information is provided, ChatGPT, a generative artificial intelligence (AI) tool, can help the general public calculate the vitamin A content of supplements. Nonetheless, supplement manufacturers do not always provide the data necessary for these calculations.",Holla S; Zamil DH; Paidisetty PS; Wang LK; Katta R
38977450,"Response to ""Letter to the Editor-Exploring the Unknown: Evaluating ChatGPT's Performance in Uncovering Novel Aspects of Plastic Surgery and Identifying Areas for Future Innovation"".",2024,Aesthetic plastic surgery,,,,"We appreciate Dr. Qi and Dr. Niu for their insightful comments on our study, ""Exploring the Unknown: Evaluating ChatGPT's Performance in Uncovering Novel Aspects of Plastic Surgery and Identifying Areas for Future Innovation."" Their observations underscore significant considerations in the application of artificial intelligence (AI) in plastic surgery. We agree with their concern about potential biases in ChatGPT's responses. The AI's frequent attribution of the title ""parent of plastic surgery"" to Sir Harold Delf Gillies, despite gender-neutral terminology, highlights underlying biases from training data. These biases often reflect historical texts and contemporary writings. Addressing them requires refining training datasets for balanced representation and developing algorithms that adjust dynamically to diverse inputs. The authors also question the criteria ChatGPT uses to identify key contributions to plastic surgery. The AI's focus on microsurgery, minimally invasive techniques, and tissue engineering, while significant, may prioritize keyword prevalence over a holistic evaluation. Enhancing ChatGPT's capabilities through targeted training and input from subject matter experts could improve the AI's ability to generate more balanced outputs. The identified bias favoring reconstructive over cosmetic procedures is another critical point. While reconstructive advancements are transformative, cosmetic surgery also has significant innovations. Ensuring ChatGPT presents a balanced view of both reconstructive and cosmetic advancements is essential. This can be achieved by diversifying training data and calibrating the AI to give equitable weight to different subspecialties within plastic surgery. AI models like ChatGPT are proficient in processing and generating information but lack the human elements of creativity, intuition, and emotional depth critical for groundbreaking innovations. AI should complement, not replace, the expert judgment and innovative thinking of skilled plastic surgeons. Ensuring the accuracy of AI-generated responses is crucial. Clinicians must verify AIgenerated information against established medical literature and clinical guidelines to maintain accuracy in medical practice. Continuous feedback and improvement mechanisms are vital to enhance AI's clinical utility. The improvement of AI in plastic surgery will be driven by active involvement from surgeons, providing comprehensive and balanced data for training to ensure AI systems evolve to support and enhance clinical practice effectively.Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Seth I; Lim B; Rozen WM
37709536,ChatGPT: Can You Prepare My Patients for [(18)F]FDG PET/CT and Explain My Reports?,2023,"Journal of nuclear medicine : official publication, Society of Nuclear Medicine",,,,"We evaluated whether the artificial intelligence chatbot ChatGPT can adequately answer patient questions related to [(18)F]FDG PET/CT in common clinical indications before and after scanning. Methods: Thirteen questions regarding [(18)F]FDG PET/CT were submitted to ChatGPT. ChatGPT was also asked to explain 6 PET/CT reports (lung cancer, Hodgkin lymphoma) and answer 6 follow-up questions (e.g., on tumor stage or recommended treatment). To be rated ""useful"" or ""appropriate,"" a response had to be adequate by the standards of the nuclear medicine staff. Inconsistency was assessed by regenerating responses. Results: Responses were rated ""appropriate"" for 92% of 25 tasks and ""useful"" for 96%. Considerable inconsistencies were found between regenerated responses for 16% of tasks. Responses to 83% of sensitive questions (e.g., staging/treatment options) were rated ""empathetic."" Conclusion: ChatGPT might adequately substitute for advice given to patients by nuclear medicine staff in the investigated settings. Improving the consistency of ChatGPT would further increase reliability.",Rogasch JMM; Metzger G; Preisler M; Galler M; Thiele F; Brenner W; Feldhaus F; Wetz C; Amthauer H; Furth C; Schatka I
38736088,Pancytopenia Due to Folate Deficiency.,2024,The Journal of the Association of Physicians of India,,,,"We found the article on ""The Digital Technology in Clinical Medicine: From Calculators to ChatGPT"" interesting.(1) According to Kulkarni et al., humanity has witnesses four important social system changes, starting with the primitive huntersgatherers and progressing to horticultural, agricultural, industrial, and the current fifth, which is based on digital information technology and has altered the way we present, recognize, and utilize different factors of production. In clinical medicine, digital technology has advanced significantly since the days of computations. According to Kulkarni et al., we have to benefit from these advancements as we all improve the lives of our patients while being cautious not to overturn the doctor-patient relationship. If technology, clinical expertise, and humanistic values are properly balanced, Kulkarni et al. concluded that the future is quite glorious.(1) Regulatory organizations are pushing for improvements through clinical trials as a result of recognition of the expanding influence of digital technology in healthcare delivery. The ""World Health Organizations Guidelines for Digital Interventions"" and the ""Food and Drug Administration's Digital Health Center of Excellence"" are only two of the projects that are currently being highlighted in the study as efforts to analyze and implement digital health services.",Anandan S; Soman S; Kumar JP; Shajee DS
38736087,Digital Technology in Clinical Medicine: Correspondence.,2024,The Journal of the Association of Physicians of India,,,,"We found the article on ""The Digital Technology in Clinical Medicine: From Calculators to ChatGPT"" interesting.(1) According to Kulkarni et al., humanity has witnesses four important social system changes, starting with the primitive huntersgatherers and progressing to horticultural, agricultural, industrial, and the current fifth, which is based on digital information technology and has altered the way we present, recognize, and utilize different factors of production. In clinical medicine, digital technology has advanced significantly since the days of computations. According to Kulkarni et al., we have to benefit from these advancements as we all improve the lives of our patients while being cautious not to overturn the doctor-patient relationship. If technology, clinical expertise, and humanistic values are properly balanced, Kulkarni et al. concluded that the future is quite glorious.(1) Regulatory organizations are pushing for improvements through clinical trials as a result of recognition of the expanding influence of digital technology in healthcare delivery. The ""World Health Organizations Guidelines for Digital Interventions"" and the ""Food and Drug Administration's Digital Health Center of Excellence"" are only two of the projects that are currently being highlighted in the study as efforts to analyze and implement digital health services.",Kleebayoon A; Wiwanitkit V
38913203,Exploring the Unknown: Evaluating ChatGPT's Performance in Uncovering Novel Aspects of Plastic Surgery and Identifying Areas for Future Innovation.,2024,Aesthetic plastic surgery,,,,"We have perused with keen interest the scholarly article titled ""Exploring the Unknown: Evaluating ChatGPT's Performance in Uncovering Novel Aspects of Plastic Surgery and Identifying Areas for Future Innovation"" penned by Lim et al. in the esteemed journal ""Aesthetic Plastic Surgery"". This paper evaluates ChatGPT's potential application in plastic surgery, exploring its responses on various themes including pioneers, advancements, and techniques, as well as flap grafting. While it offers valuable insights, questions arise. Firstly, ChatGPT's attribution of plastic surgery's progenitor to Sir Harold Delf Gillies raises concerns of underlying biases in its responses. Secondly, its assertion on paramount contributions prompts reflection on its discernment criteria. Can targeted training enhance its accuracy? Lastly, the discourse questions biases favoring reconstructive over cosmetic procedures. ChatGPT's responses, while proficient in addressing medical queries, face ongoing veracity challenges, necessitating clinician scrutiny. However, this scrutiny may surpass user expertise, requiring additional measures for accuracy assurance. Artificial Intelligence (AI) replicates human cognitive abilities but lacks human richness, potentially affecting transformative insights in plastic surgery's future trajectory. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .",Qi W; Niu F
39882908,"Ejtm3 experiences after ChatGPT and other AI approaches: values, risks, countermeasures.",2025,European journal of translational myology,,,,"We invariably hear that Artificial Intelligence (AI), a rapidly evolving technology, does not just creatively assemble known knowledge. We are told that AI learns, processes and creates, starting from fixed points to arrive at innovative solutions. In the case of scientific work, AI can generate data without ever having entered a laboratory, (i.e., blatantly plagiarizing the existing literature, a despicable old trick). How does an editor of a scientific journal recognize when she or he is faced with something like this? The solution is for editors and referees to rigorously evaluate the track records of submitting authors and what they are doing. For example, false color evaluations of 2D and 3D CT and MRI images have been used to validate functional electrical stimulation for degenerated denervated muscle and a home Full-Body In-Bed Gym program. These have been recently published in Ejtm and other journals. The editors and referees of Ejtm can exclude the possibility that the images were invented by ChatGPT. Why? Because they know the researchers: Marco Quadrelli, Aldo Morra, Daniele Coraci, Paolo Gargiulo and their collaborators as well! Artificial intelligence is not banned by the EJTM, but when submitting their manuscripts to previous and to a new Thematic Section dedicated to Generative AI in Translational Mobility Medicine authors must openly declare whether they have used artificial intelligence, of what type and for what purposes. This will not avoid risks of plagiarism or worse, but it will better establish possible liabilities.",Fano-Illic G; Coraci D; Maccarone MC; Masiero S; Quadrelli M; Morra A; Ravara B; Pond A; Forni R; Gargiulo P
37065364,Extreme Hyperthermia Due to Methamphetamine Toxicity Presenting As ST-Elevation Myocardial Infarction on EKG: A Case Report Written With ChatGPT Assistance.,2023,Cureus,,,,"We present a case report of a 37-year-old male who presented to the emergency department with altered mental status and electrocardiographic changes suggestive of an ST-elevation myocardial infarction (STEMI). He was ultimately diagnosed with extreme hyperthermia, secondary to drug use, which was managed promptly with supportive measures resulting in a successful outcome. This case highlights the importance of considering drug-induced hyperthermia as a potential cause of altered mental status and EKG changes in patients, especially in those with a history of drug abuse.",Schussler JM; Tomson C; Dresselhouse MP
38222391,From Free-text Drug Labels to Structured Medication Terminology with BERT and GPT.,2023,AMIA ... Annual Symposium proceedings. AMIA Symposium,,,,"We present a method to enrich controlled medication terminology from free-text drug labels. This is important because, while controlled medication terminology capture well-structured medication information, much of the information pertaining to medications is still found in free-text. First, we compared different Named Entity Recognition (NER) models including rule-based, feature-based, deep learning-based models with Transformers as well as ChatGPT, few-shot and fine-tuned GPT-3 to find the most suitable model that accurately extracts medication entities (ingredients, brand, dose, etc.) from free-text. Then, a rule-based Relation Extraction algorithm transforms NER results into a well-structured medication knowledge graph. Finally, a Medication Searching method takes the knowledge graph and matches it to relevant medications in the terminology server. An empirical evaluation on real-world drug labels shows that BERT-CRF was the most effective NER model with F-measure 95%. After performing terms normalization, the Medication Searching achieved an accuracy of 77% for when matching a label to relevant medication in the terminology server. The NER and Medication Searching models could be deployed as a web service capable of accepting free-text queries and returning structured medication information; thus providing a useful means of better managing medications information found in different health systems.",Ngo DH; Koopman B
40294189,nan,2025,nan,,,,"WHAT IS THE 2025 WATCH LIST? * The Watch List is an annual Horizon Scan report from Canada's Drug Agency that presents emerging technologies and issues that have the potential to shape the future of health care in Canada. * The 2025 Watch List focuses on the use of artificial intelligence (AI) technologies in health care and the issues that may arise with the implementation of these technologies. * AI technologies have the potential to significantly transform health care systems. These technologies could increase efficiency by reducing administrative burden, improve patient outcomes, and enhance patient experience by creating more access points to the health care system. However, there are also legal, ethical, environmental, and social implications with the rollout of these technologies. WHY IS THIS AN ISSUE? * Substantial public and private investments are being made in AI technologies for health care. AI technologies are already being implemented in some parts of the Canadian health care system. Commercial options, such as ChatGPT, allow AI technologies to be used by patients to assist with their health care journeys. Because they are readily available and easy to use, these same tools are sometimes used by clinicians and, in some cases, without sanction or training from employers or regulators. * AI health care technologies also present an opportunity to fundamentally change health care by their ability to replace, displace, or augment tasks that have traditionally required human cognition. The potential health human resources impact of machines taking on some this load is significant given the increasing demand for health care services and the finite capacity of health care systems in Canada. WHAT IS THE POTENTIAL IMPACT? * The Watch List signals which technologies are poised to make an impact and the policies, regulatory or organizational enablers, and/or guardrails that are needed to optimize the proliferation of these technologies in the health care system. * The 2025 Watch List also focuses on considerations for optimizing and accelerating implementation, such as the massive potential impact on operations, clinical outcomes, and staff and patient experience, while minimizing risks. WHAT ELSE DO WE NEED TO KNOW? * The 2025 Watch List of AI technologies and issues in health care was developed through consensus-based decision-making at a workshop in November 2024 including individuals from across Canada with experience and expertise in AI. . * The 2025 Watch List identifies and describes the top 5 new and emerging AI technologies in health care. Examples include AI for notetaking and AI for disease detection and diagnosis. We also explore some considerations for health care decision-makers about the impact these technologies may have on health human resources, health care infrastructure, and health equity. * The 2025 Watch List also identifies the top 5 issues related to AI technologies in health care. Examples include the importance of establishing guidelines around what data are used to train AI algorithms and how that might contribute to bias as well as considerations about the liability and accountability of health care providers and systems that use these technologies. These are key issues that warrant more attention and will influence the wider adoption, diffusion, and implementation of new and emerging AI technologies. * Monitoring ongoing developments and evidence related to the top technologies and issues highlighted in the 2025 Watch List can help guide health system planning in Canada and improve access to high-quality care.",nan
37917126,The Impact of Multimodal Large Language Models on Health Care's Future.,2023,Journal of medical Internet research,,,,"When large language models (LLMs) were introduced to the public at large in late 2022 with ChatGPT (OpenAI), the interest was unprecedented, with more than 1 billion unique users within 90 days. Until the introduction of Generative Pre-trained Transformer 4 (GPT-4) in March 2023, these LLMs only contained a single mode-text. As medicine is a multimodal discipline, the potential future versions of LLMs that can handle multimodality-meaning that they could interpret and generate not only text but also images, videos, sound, and even comprehensive documents-can be conceptualized as a significant evolution in the field of artificial intelligence (AI). This paper zooms in on the new potential of generative AI, a new form of AI that also includes tools such as LLMs, through the achievement of multimodal inputs of text, images, and speech on health care's future. We present several futuristic scenarios to illustrate the potential path forward as multimodal LLMs (M-LLMs) could represent the gateway between health care professionals and using AI for medical purposes. It is important to point out, though, that despite the unprecedented potential of generative AI in the form of M-LLMs, the human touch in medicine remains irreplaceable. AI should be seen as a tool that can augment health care professionals rather than replace them. It is also important to consider the human aspects of health care-empathy, understanding, and the doctor-patient relationship-when deploying AI.",Mesko B
37410672,ChatGPT: the threats to medical education.,2023,Postgraduate medical journal,,,,"While it offers abundant advantages, ChatGPT threatens to significantly harm the educational attainment, and the intellectual life, of students of medicine and the subjects that compliment it. This technology poses a serious threat to the ability of such students to deliver safe and effective medical care once they graduate to clinical practice. Institutions that providemedical education must react to the existence, availability, and rapidly increasing competency of GPT models. This article suggests an intervention by which this could be, at least partially, achieved.",Armitage RC
36811129,Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.,2023,Cureus,,,,"While still in its infancy, ChatGPT (Generative Pretrained Transformer), introduced in November 2022, is bound to hugely impact many industries, including healthcare, medical education, biomedical research, and scientific writing. Implications of ChatGPT, that new chatbot introduced by OpenAI on academic writing, is largely unknown. In response to the Journal of Medical Science (Cureus) Turing Test - call for case reports written with the assistance of ChatGPT, we present two cases one of homocystinuria-associated osteoporosis, and the other is on late-onset Pompe disease (LOPD), a rare metabolic disorder. We tested ChatGPT to write about the pathogenesis of these conditions. We documented the positive, negative, and rather troubling aspects of our newly introduced chatbot's performance.",Alkaissi H; McFarlane SI
39096130,Appropriateness of ChatGPT as a resource for medication-related questions.,2024,British journal of clinical pharmacology,,,,"With its increasing popularity, healthcare professionals and patients may use ChatGPT to obtain medication-related information. This study was conducted to assess ChatGPT's ability to provide satisfactory responses (i.e., directly answers the question, accurate, complete and relevant) to medication-related questions posed to an academic drug information service. ChatGPT responses were compared to responses generated by the investigators through the use of traditional resources, and references were evaluated. Thirty-nine questions were entered into ChatGPT; the three most common categories were therapeutics (8; 21%), compounding/formulation (6; 15%) and dosage (5; 13%). Ten (26%) questions were answered satisfactorily by ChatGPT. Of the 29 (74%) questions that were not answered satisfactorily, deficiencies included lack of a direct response (11; 38%), lack of accuracy (11; 38%) and/or lack of completeness (12; 41%). References were included with eight (29%) responses; each included fabricated references. Presently, healthcare professionals and consumers should be cautioned against using ChatGPT for medication-related information.",Grossman S; Zerilli T; Nathan JP
38836986,"Assessing ChatGPT's Potential in HIV Prevention Communication: A Comprehensive Evaluation of Accuracy, Completeness, and Inclusivity.",2024,AIDS and behavior,,,,"With the advancement of artificial intelligence(AI), platforms like ChatGPT have gained traction in different fields, including Medicine. This study aims to evaluate the potential of ChatGPT in addressing questions related to HIV prevention and to assess its accuracy, completeness, and inclusivity. A team consisting of 15 physicians, six members from HIV communities, and three experts in gender and queer studies designed an assessment of ChatGPT. Queries were categorized into five thematic groups: general HIV information, behaviors increasing HIV acquisition risk, HIV and pregnancy, HIV testing, and the prophylaxis use. A team of medical doctors was in charge of developing questions to be submitted to ChatGPT. The other members critically assessed the generated responses regarding level of expertise, accuracy, completeness, and inclusivity. The median accuracy score was 5.5 out of 6, with 88.4% of responses achieving a score >/= 5. Completeness had a median of 3 out of 3, while the median for inclusivity was 2 out of 3. Some thematic groups, like behaviors associated with HIV transmission and prophylaxis, exhibited higher accuracy, indicating variable performance across different topics. Issues of inclusivity were identified, notably the use of outdated terms and a lack of representation for some communities. ChatGPT demonstrates significant potential in providing accurate information on HIV-related topics. However, while responses were often scientifically accurate, they sometimes lacked the socio-political context and inclusivity essential for effective health communication. This underlines the importance of aligning AI-driven platforms with contemporary health communication strategies and ensuring the balance of accuracy and inclusivity.",De Vito A; Colpani A; Moi G; Babudieri S; Calcagno A; Calvino V; Ceccarelli M; Colpani G; d'Ettorre G; Di Biagio A; Farinella M; Falaguasta M; Foca E; Giupponi G; Habed AJ; Isenia WJ; Lo Caputo S; Marchetti G; Modesti L; Mussini C; Nunnari G; Rusconi S; Russo D; Saracino A; Serra PA; Madeddu G
40084442,[Can large language models answer clinical questions?].,2025,Recenti progressi in medicina,,,,"With the advancement of large language models (LLMs) such as ChatGPT, their application in medicine is growing, but it is crucial that the responses are aligned with international guidelines. Recent studies have shown that LLMs can be useful in the medical field, providing correct answers to questions about the management and treatment of specific diseases. However, the accuracy of these models must also include readability and thoroughness of the answers and consistency with guidelines. In addition to these characteristics, relevance, pertinence, and up-to-date nature of the sources used by the LLMs to answer questions must be ensured. Furthermore, studies are needed to investigate the consistency of responses across different LLMs and languages used by them, as well as training processes that ensure greater reliability, especially when dealing with rare or complex diseases. Although LLMs can support medical education and decision-making, their integration into clinical practice requires further validation and comparison with international guidelines.",Santoro E
38315648,Harnessing the open access version of ChatGPT for enhanced clinical opinions.,2024,PLOS digital health,,,,"With the advent of Large Language Models (LLMs) like ChatGPT, the integration of Generative Artificial Intelligence (GAI) into clinical medicine is becoming increasingly feasible. This study aimed to evaluate the ability of the freely available ChatGPT-3.5 to generate complex differential diagnoses, comparing its output to case records of the Massachusetts General Hospital published in the New England Journal of Medicine (NEJM). Forty case records were presented to ChatGPT-3.5, prompting it to provide a differential diagnosis and then narrow it down to the most likely diagnosis. The results indicated that the final diagnosis was included in ChatGPT-3.5's original differential list in 42.5% of the cases. After narrowing, ChatGPT correctly determined the final diagnosis in 27.5% of the cases, demonstrating a decrease in accuracy compared to previous studies using common chief complaints. These findings emphasize the necessity for further investigation into the capabilities and limitations of LLMs in clinical scenarios while highlighting the potential role of GAI as an augmented clinical opinion. Anticipating the growth and enhancement of GAI tools like ChatGPT, physicians and other healthcare workers will likely find increasing support in generating differential diagnoses. However, continued exploration and regulation are essential to ensure the safe and effective integration of GAI into healthcare practice. Future studies may seek to compare newer versions of ChatGPT or investigate patient outcomes with physicians integrating this GAI technology. Understanding and expanding GAI's capabilities, particularly in differential diagnosis, may foster innovation and provide additional resources, especially in underserved areas in the medical field.",Tenner ZM; Cottone MC; Chavez MR
37789676,"Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions.",2024,"Diagnostic and interventional radiology (Ankara, Turkey)",,,,"With the advent of large language models (LLMs), the artificial intelligence revolution in medicine and radiology is now more tangible than ever. Every day, an increasingly large number of articles are published that utilize LLMs in radiology. To adopt and safely implement this new technology in the field, radiologists should be familiar with its key concepts, understand at least the technical basics, and be aware of the potential risks and ethical considerations that come with it. In this review article, the authors provide an overview of the LLMs that might be relevant to the radiology community and include a brief discussion of their short history, technical basics, ChatGPT, prompt engineering, potential applications in medicine and radiology, advantages, disadvantages and risks, ethical and regulatory considerations, and future directions.",Akinci D'Antonoli T; Stanzione A; Bluethgen C; Vernuccio F; Ugga L; Klontzas ME; Cuocolo R; Cannella R; Kocak B
39563551,[Artificial intelligence and large language models: challenges and prospects in research and medicine].,2024,"Urologiia (Moscow, Russia : 1999)",,,,"With the development and spread of artificial intelligence, technologies based on the neural networks (for example, large language models) have attracted the most attention as promising methods for analyzing and processing data in various fields. Large language models (LLMs) are systems trained on huge amounts of text data and capable of generating answers to user queries. Examples of well-known LLMs are ChatGPT, Bing, Sparrow, BlenderBot, Bard, YandexGPT, GigaChat and others. Currently, artificial intelligence (AI) plays an important role in scientific and research work, including processing of medical data, making diagnoses, drafting scientific papers and documentation, writing articles, reviews and other academic materials. The evolution and use of large language models in various fields of medicine (and beyond) is presented in the article. In addition, the prospects for their future use, obstacles that hinder their active implementation and the importance of monitoring their use are analyzed.",Taratkin M S; Shchelkunova K Y; Azilgareeva C R; Ali S K; Morozov A O; Salpagarova A I; Gadzhieva Z K; Gazimiev M A
39013770,[Not Available].,2024,Journal international de bioethique et d'ethique des sciences,,,,"With the emergence of innovations and technological advancements &#8211; exemplified by telemedicine and more recently by the extremely rapid development of Generative Artificial Intelligence systems (Gen AI) like Large Language Models (LLM) such as ChatGPT &#8211; we are witnessing a progressive transformation of ancient Hippocratic medicine and the physician-patient relationship. These healthcare Gen AI, which carry inherently stakes, risks, opportunities, hopes, and concerns, justify an increased and reinforced ethical vigilance. These digital applications, multiplying, diversifying, and improving their performance day by day, tend to reshape the role and practices of healthcare professionals in their analytical and decision-making medical process.The digitalization of medicine will inevitably encourage physicians to undergo specific training and acquire new competences in new technologies so that they can learn to navigate better in this new ecosystem of knowledge and practices while preserving their medical expertise, skills, and critical thinking. Therefore, faced with the inevitable arrival of Gen AI in medicine, this article aims to question how we can approach and use this technological tool within an evolving ethical framework to maintain a quality healthcare service for users.",Monteil C; Beranger J
38977771,The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs).,2024,NPJ digital medicine,,,,"With the introduction of ChatGPT, Large Language Models (LLMs) have received enormous attention in healthcare. Despite potential benefits, researchers have underscored various ethical implications. While individual instances have garnered attention, a systematic and comprehensive overview of practical applications currently researched and ethical issues connected to them is lacking. Against this background, this work maps the ethical landscape surrounding the current deployment of LLMs in medicine and healthcare through a systematic review. Electronic databases and preprint servers were queried using a comprehensive search strategy which generated 796 records. Studies were screened and extracted following a modified rapid review approach. Methodological quality was assessed using a hybrid approach. For 53 records, a meta-aggregative synthesis was performed. Four general fields of applications emerged showcasing a dynamic exploration phase. Advantages of using LLMs are attributed to their capacity in data analysis, information provisioning, support in decision-making or mitigating information loss and enhancing information accessibility. However, our study also identifies recurrent ethical concerns connected to fairness, bias, non-maleficence, transparency, and privacy. A distinctive concern is the tendency to produce harmful or convincing but inaccurate content. Calls for ethical guidance and human oversight are recurrent. We suggest that the ethical guidance debate should be reframed to focus on defining what constitutes acceptable human oversight across the spectrum of applications. This involves considering the diversity of settings, varying potentials for harm, and different acceptable thresholds for performance and certainty in healthcare. Additionally, critical inquiry is needed to evaluate the necessity and justification of LLMs' current experimental use.",Haltaufderheide J; Ranisch R
38917600,Pre-trained language models in medicine: A survey.,2024,Artificial intelligence in medicine,,,,"With the rapid progress in Natural Language Processing (NLP), Pre-trained Language Models (PLM) such as BERT, BioBERT, and ChatGPT have shown great potential in various medical NLP tasks. This paper surveys the cutting-edge achievements in applying PLMs to various medical NLP tasks. Specifically, we first brief PLMS and outline the research of PLMs in medicine. Next, we categorise and discuss the types of tasks in medical NLP, covering text summarisation, question-answering, machine translation, sentiment analysis, named entity recognition, information extraction, medical education, relation extraction, and text mining. For each type of task, we first provide an overview of the basic concepts, the main methodologies, the advantages of applying PLMs, the basic steps of applying PLMs application, the datasets for training and testing, and the metrics for task evaluation. Subsequently, a summary of recent important research findings is presented, analysing their motivations, strengths vs weaknesses, similarities vs differences, and discussing potential limitations. Also, we assess the quality and influence of the research reviewed in this paper by comparing the citation count of the papers reviewed and the reputation and impact of the conferences and journals where they are published. Through these indicators, we further identify the most concerned research topics currently. Finally, we look forward to future research directions, including enhancing models' reliability, explainability, and fairness, to promote the application of PLMs in clinical practice. In addition, this survey also collect some download links of some model codes and the relevant datasets, which are valuable references for researchers applying NLP techniques in medicine and medical professionals seeking to enhance their expertise and healthcare service through AI technology.",Luo X; Deng Z; Yang B; Luo MY
37881016,Artificial intelligence and increasing misinformation.,2024,The British journal of psychiatry : the journal of mental science,,,,"With the recent advances in artificial intelligence (AI), patients are increasingly exposed to misleading medical information. Generative AI models, including large language models such as ChatGPT, create and modify text, images, audio and video information based on training data. Commercial use of generative AI is expanding rapidly and the public will routinely receive messages created by generative AI. However, generative AI models may be unreliable, routinely make errors and widely spread misinformation. Misinformation created by generative AI about mental illness may include factual errors, nonsense, fabricated sources and dangerous advice. Psychiatrists need to recognise that patients may receive misinformation online, including about medicine and psychiatry.",Monteith S; Glenn T; Geddes JR; Whybrow PC; Achtyes E; Bauer M
38366043,The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives.,2024,Journal of medical systems,,,,"Within the domain of Natural Language Processing (NLP), Large Language Models (LLMs) represent sophisticated models engineered to comprehend, generate, and manipulate text resembling human language on an extensive scale. They are transformer-based deep learning architectures, obtained through the scaling of model size, pretraining of corpora, and computational resources. The potential healthcare applications of these models primarily involve chatbots and interaction systems for clinical documentation management, and medical literature summarization (Biomedical NLP). The challenge in this field lies in the research for applications in diagnostic and clinical decision support, as well as patient triage. Therefore, LLMs can be used for multiple tasks within patient care, research, and education. Throughout 2023, there has been an escalation in the release of LLMs, some of which are applicable in the healthcare domain. This remarkable output is largely the effect of the customization of pre-trained models for applications like chatbots, virtual assistants, or any system requiring human-like conversational engagement. As healthcare professionals, we recognize the imperative to stay at the forefront of knowledge. However, keeping abreast of the rapid evolution of this technology is practically unattainable, and, above all, understanding its potential applications and limitations remains a subject of ongoing debate. Consequently, this article aims to provide a succinct overview of the recently released LLMs, emphasizing their potential use in the field of medicine. Perspectives for a more extensive range of safe and effective applications are also discussed. The upcoming evolutionary leap involves the transition from an AI-powered model primarily designed for answering medical questions to a more versatile and practical tool for healthcare providers such as generalist biomedical AI systems for multimodal-based calibrated decision-making processes. On the other hand, the development of more accurate virtual clinical partners could enhance patient engagement, offering personalized support, and improving chronic disease management.",Cascella M; Semeraro F; Montomoli J; Bellini V; Piazza O; Bignami E
40172683,"Hyper-DREAM, a Multimodal Digital Transformation Hypertension Management Platform Integrating Large Language Model and Digital Phenotyping: Multicenter Development and Initial Validation Study.",2025,Journal of medical systems,,,,"Within the mHealth framework, systematic research that collects and analyzes patient data to establish comprehensive digital health archives for hypertensive patients, and leverages large language models (LLMs) to assist clinicians in health management and Blood Pressure (BP) control remains limited. In this study, our aims to describe the design, development and usability evaluation process of a management platform (Hyper-DREAM) for hypertension. Our multidisciplinary team employed an iterative design approach over the course of a year to develop the Hyper-DREAM platform. This platform's primary functionalities encompass multimodal data collection (personal hypertensive digital phenotype archive), multimodal interventions (BP measurement, medication assistance, behavior modification, and hypertension education) and multimodal interactions (clinician-patient engagement and BP Coach component). In August 2024, the mHealth App Usability Questionnaire (MAUQ) was conducted involving 51 hypertensive patients recruited from three distinct centers. In parallel, six clinicians engaged in management activities and contributed feedback via the Doctor's Software Satisfaction Questionnaire (DSSQ). Concurrently, a real-world comparative experiment was conducted to evaluate the usability of the BP Coach, ChatGPT-4o Mini, ChatGPT-4o and clinicians. The comparative experiment demonstrated that the BP Coach achieved significantly higher scores in utility (mean scores 4.05, SD 0.87) and completeness (mean scores 4.12, SD 0.78) when compared to ChatGPT-4o Mini, ChatGPT-4o, and clinicians. In terms of clarity, the BP Coach was slightly lower than clinicians (mean scores 4.03, SD 0.88). In addition, the BP Coach exhibited lower performance in conciseness (mean scores 3.00, SD 0.96). Clinicians reported a marked improvement in work efficiency (2.67 vs. 4.17, P < .001) and experienced faster and more effective patient interactions (3.0 vs. 4.17, P = .004). Furthermore, the Hyper-DREAM platform significantly decreased work intensity (2.5 vs. 3.5, P = .01) and minimized disruptions to daily routines (2.33 vs. 3.55, P = .004). The Hyper-DREAM platform demonstrated significantly greater overall satisfaction compared to the WeChat-based standard management (3.33 vs. 4.17, P = .01). Additionally, clinicians exhibited a markedly higher willingness to integrate the Hyper-DREAM platform into clinical practice (2.67 vs. 4.17, P < .001). Furthermore, patient management time decreased from 11.5 min (SD 1.87) with Wechat-based standard management to 7.5 min (SD 1.84, P = .01) with Hyper-DREAM. Hypertensive patients reported high satisfaction with the Hyper-DREAM platform, including ease of use (mean scores 1.60, SD 0.69), system information arrangement (mean scores 1.69, SD 0.71), and usefulness (mean scores 1.57, SD 0.58). In conclusion, our study presents Hyper-DREAM, a novel artificial intelligence-driven platform for hypertension management, designed to alleviate clinician workload and exhibiting significant promise for clinical application. The Hyper-DREAM platform is distinguished by its user-friendliness, high satisfaction rates, utility, and effective organization of information. Furthermore, the BP Coach component underscores the potential of LLMs in advancing mHealth approaches to hypertension management.",Wang Y; Zhu T; Zhou T; Wu B; Tan W; Ma K; Yao Z; Wang J; Li S; Qin F; Xu Y; Tan L; Liu J; Wang J