Evidence syntheses require independent reviewers to extract data and assess the risk of bias (ROB)1, resulting in a labor-intensive and time-consuming process2, especially particularly in complementary and alternative medicine (CAM) research3,4. CAM has gained prominence due to its efficacy and safety profiles, leading to increased adoption by both clinicians and patients5,6. However, the lack of high-quality evidence necessitates efficient syntheses to support clinical practice7. Challenges include CAM’s complex, discipline-specific terminology and multilingual literature, complicating data extraction8.
Recent advances in generative artificial intelligence have produced outstanding large language models (LLMs) capable of analyzing vast text corpora, capturing complex contexts, and adapting to specialized domains, making them suitable for evidence synthesis9. Preliminary studies suggest LLMs’ potential in systematic reviews and meta-analyses10,11,12,13. However, their application in CAM is limited due to difficulties in creating CAM-specific prompts, maintaining precision in terminology, and handling diverse languages and study designs14. Moreover, the potential for LLM-assisted methods, where AI and human expertise work in tandem, remains largely unexplored. This study aimed to develop structured prompts for guiding LLMs in extracting both basic and CAM-specific data and assessing ROB in randomized controlled trials (RCTs) on CAM interventions. We compared LLM-only and LLM-assisted methods to conventional approaches, seeking to enhance efficiency and quality in data extraction and ROB assessment, ultimately supporting clinical practice and guidelines.
We randomly selected 107 RCTs (Supplementary Table 1) from 12 Cochrane reviews15,16,17,18,19,20,21,22,23,24,25,26, spanning 1979–2024, with 27.1% in English and 72.9% in Chinese; 44.9% were published post-2013. Studies focused on mind-body practices (41.1%), herbal decoctions (34.6%), and natural products (24.3%). Based on OCR recognizability, 94.4% of RCTs had higher recognizability (≥70% of text and data accurately detected), while 5.6% had lower recognizability (<70%).
Two LLMs—Claude-3.5-sonnet and Moonshot-v1-128k—were employed to extract data and assess ROB. Supplementary Notes 1 to 4 document all responses from two models. Supplementary Tables 2 to 7 present the analysis results of LLM-only and LLM-assisted extractions and assessments.
From 107 RCTs, both models produced 12,814 extractions. As shown in Fig. 1, Claude-3.5-sonnet showed superior overall accuracy (96.2%, 95% CI: 95.8–96.5%) compared to Moonshot-v1-128k (95.1%, 95% CI: 94.7–95.5%), with a statistically significant difference (RD: 1.1%, 95% CI: 0.6–1.6%; p < 0.001). Claude outperformed in the Baseline Characteristics domain, while both models had similar accuracy across other domains. For Moonshot-v1-128k, the highest correctness rate was in the Outcomes domain (97.6%), and the lowest was in the Methods domain (90.9%). Errors in Moonshot-v1-128k’s extractions often resulted from incorrectly labeling data as “Not reported.” Commonly missed information included start/end dates (44 RCTs), baseline balance descriptions (39), number analyzed (40), demographics (22), theoretical basis (21), treatment frequency (8), all outcomes in 3 RCTs, and specific outcome data in 45 RCTs. However, Moonshot-v1-128k successfully extracted CAM-specific data, such as traditional Chinese medicine terminology. The inter-model agreement rate between Claude and Moonshot-v1-128k was 93.8%, with 83.3% of Claude’s errors also present in Moonshot-v1-128k’s results.

This figure compares the accuracy of two language models, Claude-3.5-sonnet and Moonshot-v1-128k, in data extraction and risk-of-bias (ROB) assessments across multiple domains in 107 RCTs. Claude-3.5-sonnet demonstrated higher overall accuracy in data extraction (96.2%, 95% CI: 95.8% to 96.5%) than Moonshot-v1-128k (95.1%, 95% CI: 94.7% to 95.5%), with a statistically significant difference of 1.1% (p < 0.001). In the ROB assessment, Claude-3.5-sonnet also achieved slightly higher accuracy (96.9% vs. 95.7%), though the difference was not statistically significant. The greatest difference in domain-specific accuracy was observed in the Baseline Characteristics for data extraction, where Claude-3.5-sonnet outperformed Moonshot-v1-128k.
Investigators refined Moonshot-v1-128k’s extractions, achieving a corrected accuracy of 97.9% (95% CI: 97.7–98.2%), higher than the expected 95.3% for conventional methods (RD: 2.6%, 95% CI: 2.2–3.1%; p < 0.001; Fig. 2). The RD between LLM-assisted and LLM-only extractions was 2.8% (95% CI: 2.4–3.2%; p < 0.001; Table 1). Accuracy improvements were most notable in the Methods domain (+7.4%) and the Data and Analysis domain (+6.0%). Subgroup analyses revealed that higher PDF recognizability positively impacted Moonshot-v1-128k’s accuracy (p interaction = 0.023) but had no significant effect on LLM-assisted accuracy (p interaction = 0.100). Claude achieved higher accuracy in extracting data from English RCTs compared to Chinese RCTs (p interaction = 0.000).

This figure presents a comparison of the accuracy (correct rate) and efficiency (time spent) of three methods for data extraction and risk of bias (ROB) assessment: conventional, LLM-only, and LLM-assisted. For data extractions, the conventional method had an estimated accuracy of 95.3% and took 86.9 min per RCT. The LLM-only method achieved an accuracy of 95.1% and took only 96 s per RCT, while the LLM-assisted method had the highest accuracy at 97.9% and took 14.7 min per RCT. For ROB assessments, the conventional method had an estimated accuracy of 90.0% and took 10.4 min per RCT. The LLM-only method achieved an accuracy of 95.7% and took only 42 s per RCT, while the LLM-assisted method had the highest accuracy at 97.3% and took 5.9 min per RCT. These results demonstrate that LLM-assisted methods can achieve higher accuracy than conventional methods while being substantially more efficient.
Both models conducted 1,070 ROB assessments. As shown in Fig. 1, Claude achieved 96.9% accuracy (95% CI: 95.7–97.9%), slightly higher than Moonshot-v1-128k’s 95.7% (95% CI: 94.3–96.8%), though the difference was not statistically significant (RD: 1.2%, 95% CI: –0.4–2.8%). Moonshot-v1-128k’s lowest accuracy was in the Sequence generation domain (87.9%), while other domains ranged from 94.4% to 100.0%. Sensitivities in Selective outcome reporting and Other bias were relatively low (0.50 and 0.40), with corresponding F-scores of 0.67 and 0.44, but other domains had F-scores between 0.97 and 1.00. Of the 46 incorrect assessments, 62.1% were due to missing supporting information, while 37.9% involved correct data extraction but erroneous judgments. Cohen’s kappa values indicated substantial to almost perfect agreement in most domains, except for Selective outcome reporting (0.66) and Other bias (0.42), likely due to high true negative rates (>93%). The inter-model agreement between Claude-3.5-sonnet and Moonshot-v1-128k was almost perfect (Cohen’s kappa = 0.88), with 66.7% of Claude’s errors also present in Moonshot-v1-128k’s results.
After refinement based on Moonshot-v1-128k’s assessments, the mean correctness rate of LLM-assisted ROB assessments increased to 97.3% (95% CI: 96.1–98.2%), significantly surpassing the expected 90.0% accuracy of conventional methods (RD: 7.3%, 95% CI: 6.2–8.3%; p < 0.001; Fig. 2). The RD between LLM-assisted and LLM-only assessments was 1.6% (95% CI: 0.0–3.2%; p = 0.05; Table 2), indicating that human review corrected some errors, leading to improved accuracy. The PABAK (Prevalence-Adjusted Bias-Adjusted Kappa) among the four investigators was 0.88, signifying an almost perfect agreement. The Sequence generation domain exhibited the greatest improvement in accuracy (+8.4%), and all errors in the Allocation sequence concealment domain were rectified, achieving a 100% correctness rate. Subgroup analysis revealed that Claude-3.5-sonnet achieved significantly higher accuracy in assessing ROB for English-language RCTs compared to Chinese-language RCTs (p interaction = 0.000). Conversely, LLM-assisted assessments showed higher accuracy for RCTs published in Chinese (p interaction = 0.023), suggesting that the investigators’ native language influenced assessment accuracy.
For both data extraction and ROB assessment, the LLM models demonstrated significant time savings compared to conventional methods. Data extraction took an average of 96 s per RCT with Moonshot-v1-128k and 82 s with Claude-3.5-sonnet, while refinement extended Moonshot-v1-128k-assisted extractions to 14.7 min per RCT—still much faster than the 86.9 min required by traditional approaches. Similarly, ROB assessments averaged 42 s per RCT with Moonshot-v1-128k and 41 s with Claude, with Moonshot-v1-128k-assisted assessments, including refinement, taking just 5.9 min per RCT compared to 10.4 min for conventional methods.
Overall, both Claude-3.5-sonnet and Moonshot-v1-128k demonstrated high accuracy, with LLM-assisted methods significantly outperforming conventional approaches in both accuracy and efficiency (Fig. 2). Claude-3.5-sonnet achieved slightly higher accuracy than Moonshot-v1-128k for data extraction (96.2% vs. 95.1%) and ROB assessment (96.9% vs. 95.7%), though the difference was statistically significant only for data extraction. Errors in both models often stemmed from failing to identify reported data, such as start/end dates and participant numbers, rather than misinterpreting extracted information. ROB assessment errors were most frequent in the Sequence generation domain, where inconsistent judgments arose despite correct justifications. For example, Moonshot-v1-128k accurately identified the randomization method in Mao, 2014 (Supplementary Note 2) but incorrectly classified it, suggesting challenges in applying rule-based criteria.
LLM-assisted methods proved particularly effective in addressing these issues. Human reviewers identified and corrected common error patterns, significantly improving accuracy, especially in the Methods domain (+7.4%) for data extraction and the Sequence generation domain (+8.4%) for ROB assessment. By addressing these recurring issues, reviewers not only enhanced the reliability of individual assessments but also provided insights into systematic weaknesses in LLM outputs. This process highlighted the critical role of human expertise in working with LLMs, as reviewers could identify specific areas needing improvement and ensure the conclusions were accurate and reliable.
Efficiency gains were substantial. For data extraction, time per RCT decreased from 86.9 min to 14.7 min, while ROB assessment times dropped from 10.4 min to 5.9 min. The combination of time savings and improved accuracy significantly enhances evidence synthesis, particularly in complex domains like CAM that demand specialized knowledge.
Subgroup analyses revealed that higher PDF recognizability improved Moonshot-v1-128k’s extraction accuracy (p interaction = 0.023), but Claude-3.5-sonnet outperformed in extracting data from English-language RCTs compared to Chinese-language ones (p interaction = 0.000). This subgroup difference may be attributed to variations in the training datasets of the models, underscoring the need to consider document quality and language characteristics when applying LLMs to evidence synthesis.
This study aligns with previous research demonstrating high accuracy for LLMs like Claude 2 and GPT-4 in data extraction when guided by structured prompts27,28,29. Our findings build on earlier work by optimizing prompts to enhance domain-specific judgment logic, step-wise reasoning, and few-shot learning, leading to significantly higher accuracy in ROB assessments compared to prior studies30. Additionally, incorporating confidence estimates and justifications for each domain allowed investigators to more effectively identify errors.
Strengths of this study include validated prompts, a diverse sample of RCTs, and the inclusion of less experienced reviewers to estimate practical effects. Limitations include potential language-dependent biases, as all reviewers were native Chinese speakers, and reliance on benchmark estimates for conventional methods, which may not fully reflect current practices across diverse settings. The RCTs included in our study had first authors affiliated with institutions in 12 countries and regions, but 83 studies (77.6%) were from mainland China. While we analyzed the publication language of the RCTs, it is possible that the primary language background of researchers was predominantly Chinese, which may pose challenges for generalizing these findings. While this study demonstrated the feasibility and effectiveness of LLM-assisted methods using Claude-3.5-sonnet, future research should explore the application of other high-ranking models to ensure broader generalizability. Incorporating models from authoritative rankings or widely recognized sources could validate whether the approach is robust across different architectures and datasets, enhancing its applicability in diverse research contexts.
link