A paper titled, “Introducing Answered with Evidence – A framework for evaluating whether large language model (LLM) responses to biomedical questions are founded in evidence,” was published in medrxiv by authors affiliated with Atropos Health as an update to a prior paper by the same title.
Short summary:
The growing use of large language models (LLMs) for biomedical question answering raises concerns about the accuracy and evidentiary support of their responses. To address this, we present Answered with Evidence, a framework for evaluating whether LLM generated answers are grounded in scientific literature. We analyzed thousands of physician submitted questions using a comparative pipeline across seven LLMs grounded in different evidence sources. Six sources were grounded in PubMed or general online content, and the last source was grounded in the Atropos Evidence™ Library, called Alexandria®, of custom real world analyses. We found that the general-purpose LLMs grounded in public information varied greatly in the answers they returned, even when those answers were sourced from the same publication. Using an ensemble approach, we observed that 49% of the time, two or more LLMs agreed on an answer. Combined, the ensemble approach and the Alexandria custom-built source enabled reliable answers to over 64% of biomedical queries. As LLMs become increasingly capable of summarizing scientific content, maximizing their value will require systems that can accurately retrieve both published and custom generated evidence or generate reliable evidence in real time.
