Using ChatGPT-4 for Lay Summarization in Prostate Cancer Research to Advance Patient-Centered Communication: Large-Scale Generative AI Performance Evaluation

Emily Rinderknecht*, Simon U. Engelmann, Veronika Saberi, Clemens Kirschner, Anton P. Kravchuk, Anna Schmelzer, Johannes Breyer, Christopher Goßler, Roman Mayr, Christian Gilfrich, Maximilian Burger, Dominik von Winning, Hendrik Borgmann, Christian Wülfing, Axel S. Merseburger, Maximilian Haas, Matthias May

*Corresponding author for this work

Abstract

Background: The increasing volume and complexity of biomedical literature pose challenges for making scientific knowledge accessible to lay audiences. Lay summaries, now widely encouraged or required by journals, aim to bridge this gap by promoting health literacy, patient engagement, and public trust. However, many are written by scientists without formal training in plain-language communication, often resulting in limited clarity, readability, and consistency. Generative large language models such as ChatGPT-4 offer a scalable opportunity to support lay summary creation, though their effectiveness within specific clinical domains has not been systematically evaluated at scale. Objective: This study aimed to assess ChatGPT-4’s performance in generating lay summaries for prostate cancer studies. A secondary objective was to evaluate how prompt design influences summary quality, aiming to provide practical guidance for the use of generative artificial intelligence (AI) in scientific publishing. Methods: A total of 204 consecutive articles on prostate cancer were extracted from a high-ranking oncology journal mandating lay summaries. Each abstract was processed with ChatGPT-4 using 2 prompts: a simple prompt based on the journal’s guidelines and an extended prompt refined to improve readability. AI-generated and original summaries were evaluated using 3 criteria: readability (Flesch-Kincaid Reading Ease [FKRE]), factual accuracy (5-point Likert scale, blinded rating by 2 clinical experts), and compliance with word count instructions (120‐150 words). Summaries were classified as high-quality as a composite outcome if they met all 3 benchmarks: FKRE >30, accuracy ≥4 from both raters, and word count within range. Statistical comparisons used Wilcoxon signed-rank and paired 2-tailed t tests (P<.05). Results: ChatGPT-4-generated lay summaries showed an improvement in readability compared to human-written versions, with the extended prompt achieving higher scores than the simple prompt (median FKRE: extended prompt 47, IQR 42-56; simple prompt 36, IQR 29-43; original 20, IQR 9.5‐29; P<.001).

original 5, IQR 4-5; P<.001) in this dataset. Compliance with word count instructions was greater for both AI-generated summaries in comparison to originals (wrong number of words; extended prompt 39 (19%), simple prompt 40 (20%), original 140 (69%); P<.001). Between simple and extended prompts, there were no significant differences in accuracy (P=.53) and word count compliance (P=.87). The proportion rated as high-quality was 79.4% for the extended prompt, 54.9% for the simple prompt, and 5.4% for original summaries (P<.001). Conclusions: With optimized prompting, ChatGPT-4 produced lay summaries that, on average, scored higher than author-written versions in readability, factual accuracy, and structural compliance within our dataset. These results support integrating generative AI into editorial workflows to improve science communication for nonexpert audiences. Limitations include focus on a single clinical domain and journal, and absence of layperson evaluation.

Original languageEnglish
Article numbere76598
JournalJournal of Medical Internet Research
Volume27
DOIs
Publication statusPublished - 2025

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Research Areas and Centers

  • Research Area: Luebeck Integrated Oncology Network (LION)
  • Centers: Center for Artificial Intelligence Luebeck (ZKIL)

DFG Research Classification Scheme

  • 2.22-23 Reproductive Medicine, Urology
  • 4.43-04 Artificial Intelligence and Machine Learning Methods

KDSF Research Field Classification Scheme

  • 073 - Artificial intelligence and big data

Fingerprint

Dive into the research topics of 'Using ChatGPT-4 for Lay Summarization in Prostate Cancer Research to Advance Patient-Centered Communication: Large-Scale Generative AI Performance Evaluation'. Together they form a unique fingerprint.

Cite this