Cookies on Paperstars

We use Google Analytics to understand how people use Paperstars and improve the site. Are you okay with that?

Evaluating the Reporting Quality of 21,041 Randomized Controlled Trial Articles with Large Language Models: A Large-Scale Transparency Analysis

Srinivasan A. Kivelson S. Friedrich N. Berkowitz J. Tatonetti N (2025). Evaluating the Reporting Quality of 21,041 Randomized Controlled Trial Articles with Large Language Models: A Large-Scale Transparency Analysis. Lecture Notes in Computer Science, 428-437. https://doi.org/10.1007/978-3-031-95838-0_42

Overall rating
(4.0) 1 review
Authors
Apoorva Srinivasan, Sophia Kivelson, Nadine A. Friedrich, Jacob Berkowitz, Nicholas Tatonetti
Journal
Lecture Notes in Computer Science
First published
2025
Type
Book Chapter
ISBN
9783031958373, 9783031958380

Embed Review

Share a live Paperstars badge for Evaluating the Reporting Quality of 21,041 Randomized Controlled Trial Articles with Large Language Models: A Large-Scale Transparency Analysis. Copy the HTML below and paste it into your site.

Paperstars badge preview

Reviews

Informative Title

100%
Appropriate
Slightly Misleading
Exaggerated

Methods

100%
Sound
Questionable
Inadequate

Statistical Analysis

100%
Appropriate
Some Issues
Major concerns

Data Presentation

100%
Complete and Transparent
Minor Omissions
Misrepresented

Discussion

100%
Appropriate
Slightly Misleading
Exaggerated

Limitations

100%
Appropriately acknowledged
Minor Omissions
Inadequate

Data Available

100%
Completely Available
Partial data available
Not Open Access

Sign in to add a review. Help the research community by sharing your assessment of this book-chapter.

BurgundyPhMeter Jul 16, 2025

I found the study informative and included it in a review I am writing. However, the limitation stated at the end was about the coarse assessment of CONSORT reporting items being included without assessing the quality of the reporting. If this is the case, a casual reader could be greatly misled by skimming the paper. The authors could have randomly sampled from the data and assessed reporting quality. Why was this not the case? I think the numbers may be inflated. We could also compare these high-accuracy values to a recent paper using older models. It seems unlikely that moving from GPT 3.5 to GPT-4 solved all problems: Woelfle, T., Hirt, J., Janiaud, P., Kappos, L., Ioannidis, J. P. A., & Hemkens, L. G. (2024). Benchmarking Human–AI collaboration for common evidence appraisal tools. Journal of Clinical Epidemiology, 175, 111533. doi: 10.1016/j.jclinepi.2024.111533