Patients and clinicians: LLMs achieve high QA accuracy but require human evaluation for clinical safety

Review: LLMs score highly on QA but need human-in-loop, real-world evaluation for safe clinical use.

August 15, 2025by Jose Zea

Advancing the Evaluation of Large Language Models and Conversational Agents in Healthcare – A Comprehensive Review Uncovers Key Assessment Challenges and Emerging Strategies

As Artificial Intelligence (AI) technologies like Large Language Models (LLMs) and Conversational Agents (CAs) rapidly enter healthcare, their potential to enhance clinical decision-making and patient support is huge. Yet, ensuring these AI tools are safe, accurate, and effective requires rigorous evaluation—something the field still grapples with. A new comprehensive literature review by Arkangel AI researchers sheds light on the current landscape of LLM and CA assessments in clinical settings, highlighting existing methods, their limitations, and promising paths forward.

The review finds that while question-answering datasets simulating clinical exams remain the de-facto standard for evaluating medical knowledge and reasoning, they do not fully capture model safety, real-world efficacy, or user interaction quality. Human evaluation remains critical but is resource-intensive and limited in scale. The authors advocate for combining quantitative automated metrics with qualitative human assessment, alongside innovative frameworks emphasizing real-world human-AI interaction and safety risk evaluation. Their analysis serves as a roadmap to guide future development and deployment of these transformative AI tools.

Study Partnership & Context

This extensive review was conducted by a multidisciplinary team from Arkangel AI, including medical epidemiologists, biomedical engineers, and machine learning experts. The research draws on a broad spectrum of sources, including peer-reviewed journals, preprints, conference proceedings, and expert consensus statements published from 2015 through 2024. Insights from recent global health symposia and national congresses in Colombia add valuable context, reflecting real-world clinical priorities and user needs.

The setting is particularly important because it reflects a growing demand for trustworthy AI evaluation frameworks suited to dynamic clinical environments. Arkangel AI’s team emphasizes the need to bridge gaps between rapid AI advancements in natural language understanding and their practical assessment in healthcare—where patient safety, clinical accuracy, and ethical considerations are paramount.

Study Design and Methodology

The study used an unstructured, narrative literature review methodology. It involved in-depth analysis of 40 relevant manuscripts that covered diverse study designs such as systematic reviews, expert consensus papers, editorials, and technical reports. Databases explored included PubMed, Arxiv, MedRxiv, and Google Scholar, complemented by sourcing “grey literature” and AI model leaderboard data. The scope of evaluation methods encompassed both automated question-answering (QA) datasets and human evaluation (HE) frameworks.

Key AI tools considered are LLMs and CAs trained on massive datasets spanning clinical textbooks, medical exams, research literature, and online medical dialogues. These models leverage natural language processing architectures like BioBERT, GPT-4, Med-PaLM 2, and other fine-tuned transformers to simulate clinical knowledge and reasoning.

Key Results

Question-Answering Datasets:
- MedQA (USMLE-based) – GPT-4 with Medprompt achieved up to 90.2% accuracy, surpassing earlier models like BioBERTLarge (42.0%) and human passing score (60%).
- MedMCQA (India-based) – Med-PaLM 2 reached 72.3% accuracy versus previous state-of-the-art of 47% and a 50% human passing score.
- PubMedQA (Biomedical abstracts) – GPT-4 with Medprompt scored 81.6% accuracy, exceeding a human expert benchmark of 78%.
- MMLU Clinical Subset – Med-PaLM 2 scored between 84.4% and 95.8% across various medical specialties.
- Naturalistic datasets like MeDiaQA assess conversational comprehension, highlighting expanding evaluation beyond factual Q&A to dialogue understanding.
Limitations of Automated QA Evaluation:
- Evaluation depends heavily on prompt specificity; models often hallucinate or reason incorrectly.
- Standard QA tasks do not measure communication quality or adaptability to diverse user inputs.
- Automated metrics like BLEU and ROUGE lack correlation with human judgment on clinical relevance and safety.
Human Evaluation Frameworks:
- Human expert review remains the gold standard for assessing accuracy, relevance, and safety, though costly and logistically challenging.
- Studies using structured scales show inter-rater agreement often low (kappa <0.5), underlining evaluation complexity.
- Large-scale human trials involving clinicians and nurses assessing conversational agents indicate mixed results on bedside manner, clinical reasoning, and safety issues.
- Novel evaluation frameworks propose integrating human review with AI-assisted scoring to enhance scalability and consistency.
Emerging Perspectives:
- Human-Interaction Evaluations (HIEs) focus on the socio-technical gap, measuring real-world use, safety risks, and task completion in clinical workflows.
- Frameworks addressing risk identification, contextual use, and human-AI collaboration dynamics are gaining traction to guide design and deployment.
- Approaches encouraging reflective human-AI deliberation show promise for complex, high-stakes clinical decisions.

Interpretation & Implications

This review paints a clear picture: while LLMs have achieved remarkable medical question-answering performance rivaling or exceeding human benchmarks, automated evaluation alone cannot ensure clinical safety or usability. The unpredictability of model reasoning, susceptibility to bias, and frequent hallucinations require layered assessment approaches.

For clinicians and health systems, these findings emphasize that deploying LLM-based tools demands robust, multi-dimensional evaluation encompassing not only knowledge accuracy but also communication style, interaction quality, and risk mitigation. Combining automated QA tests with carefully designed human reviews creates a balanced validation ecosystem. Additionally, incorporating real-world use scenarios and human interaction insights is vital to achieving trustworthy AI that genuinely supports clinical workflows and patient outcomes.

However, challenges remain: human evaluations are resource-intensive and prone to variability, while current QA datasets are limited in scope and may not represent diverse clinical contexts fully. Developing standardized, validated evaluation instruments and expanding practical trials in real healthcare settings will be key future steps.

Deployment & Scalability

Though the reviewed models are not deployed as standalone clinical decision support systems yet, many are integrated into prototype conversational agents designed for healthcare providers and patients. The evaluation insights guide future deployment strategies emphasizing safety, interpretability, and usability.

Barriers identified include the high cost and time burden of rigorous human assessment, difficulty in scaling evaluations to cover vast clinical scenarios, and ensuring adaptability to varying health literacy levels and languages. To overcome these, innovative solutions such as AI-assisted human evaluation, continuous post-deployment monitoring, and modular evaluation frameworks tailored to specific use cases are proposed.

Furthermore, given that LLMs are generalized technologies, evaluation approaches developed here have cross-cutting relevance for other medical specialties, languages, and emerging AI conversational systems. This flexibility supports wider scalability and long-term integration in diverse healthcare environments.

Conclusion & Next Steps

The Arkangel AI review underscores that evaluating large language models and conversational agents for healthcare is an evolving, complex endeavor. While existing question-answering benchmarks provide valuable insights into clinical knowledge and reasoning capabilities, they do not capture safety, interaction quality, or real-world efficacy comprehensively.

Human evaluation remains essential but must be augmented by scalable, objective metrics and frameworks focusing on human-AI collaboration and context-specific risks. Future research priorities include developing standardized evaluation protocols, adapting assessments to diverse clinical settings and users, and embedding ongoing evaluation in deployed AI tools to ensure continual safety and effectiveness.

As LLMs continue advancing and healthcare AI adoption grows, building robust, multi-faceted evaluation infrastructures will be critical to unlocking the full potential of conversational agents while safeguarding patients and clinicians alike.

For healthcare innovation leaders, this comprehensive synthesis provides a foundation for designing, validating, and deploying trustworthy AI conversational tools that meet the highest clinical standards.

Reference: Castano-Villegas N, Llano I, Martinez J, Jimenez D, Villa MC, Zea J, Velasquez L. "Approaches to Evaluating Large Language Models and Conversational Agents for Healthcare Applications." Arkangel AI, 2024. [Full text available on request.]