Short Summary:
Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. This paper introduces RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task.
The paper uses RWESummary to compare the performance of different LLMs in Atropos Health’s internal RWE summarization tool. The results suggest that RWESummary is a novel and useful foundation model benchmark for real-world evidence study summarization.
Discussion:
The applications for and use of RWE and LLMs are evolving rapidly. While there isn’t likely to be a model to rule them all, in the short to mid-term, it’s critical to make pragmatic decisions to choose how to automate RWE summarization in a narrative form in order to take advantage of the other technologies facilitating the generation and provisioning of RWE at the bedside and for policy decisions. The methods in the paper create a benchmark and facilitate driving pragmatic decisions around LLM model selection with data.
To learn how Atropos Health can accelerate and supplement your research with real-world evidence (RWE), contact sales@atroposhealth.com.