Vitruvius conversational AI achieves 90% accuracy on USMLE-style clinical queries for patients across specialties
Vitruvius: multi-LLM, retrieval-augmented chat answers USMLE-style queries with 90.3% accuracy.
Vitruvius: Elevating Medical Question Answering with Conversational AI – Achieving 90% Accuracy on USMLE-Style Clinical Queries
Healthcare professionals face an ever-growing flood of clinical knowledge and research findings, making it increasingly difficult to stay current and efficiently access reliable medical information during patient care. In this context, Artificial Intelligence (AI) powered by Large Language Models (LLMs) promises to revolutionize how clinicians retrieve and interpret evidence-based knowledge in real time. The recent study on Vitruvius, a novel conversational agent, brings this promise closer to clinical reality by demonstrating state-of-the-art capabilities in understanding and accurately responding to complex medical questions.
Vitruvius leverages a multi-LLM system combined with real-time retrieval of trusted medical literature to answer clinical queries with an accuracy surpassing 90% using USMLE-style questions. This performance exceeds that of widely used medical LLMs and shows the potential for AI-powered assistants to enhance clinical decision support and research accessibility, while still respecting the critical role of human expertise.
Introducing Vitruvius: Conversational AI Designed for Clinical Knowledge Retrieval
In healthcare, timely access to accurate, evidence-based information is paramount. Clinicians frequently consult medical guidelines, research articles, and best practices to guide patient management. However, existing methods—manual database searches, static clinical support tools—are often time-consuming and fail to integrate the breadth of available evidence dynamically. They also lack interactive conversation ability, limiting ease of use during busy clinical workflows.
Vitruvius addresses these challenges by embodying a conversational agent powered by five specialized LLMs that collectively manage information retrieval, synthesis, reasoning, and response generation. The system actively queries databases like PubMed and Google to retrieve relevant clinical guidelines and research articles. By automatically classifying query types (clinical reference, research, diagnostic, or general information), it adapts its search strategy to produce precise, evidence-backed answers across multiple languages.
Tested against the MedQA dataset—a benchmark comprising over 1,200 U.S. medical licensing exam questions—Vitruvius' latest iteration achieved an outstanding 90.26% accuracy, outperforming prominent models such as GPT-4o and Med-PaLM 2. Its robust performance highlights its potential as a powerful, real-time assistant for clinical knowledge discovery and evidence-based medicine.
Study Partnership and Context
This study was conducted by the Arkangel AI team in Bogotá, Colombia, a company specializing in healthcare-focused AI applications. The setting is particularly significant given the global demand for innovative solutions that bridge the gap between rapidly evolving medical evidence and clinicians’ workflow constraints, including regions where access to updated clinical knowledge remains challenging.
By targeting a broad spectrum of healthcare queries and incorporating multilingual capabilities (English, Spanish, Portuguese), Vitruvius addresses diverse patient populations and healthcare systems. This inclusivity enhances its potential for deployment in varying clinical environments, including resource-limited settings.
Study Design and Methodology
The evaluation employed the MedQA dataset, specifically the 1,273-question test set presenting USMLE-style multiple-choice queries covering a wide range of specialties such as pediatrics, endocrinology, and oncology. Questions vary in complexity, including those requiring single-step reasoning and others involving multi-step clinical case analyses.
Vitruvius comprises five large language models working in concert through a Retrieval-Augmented Generation (RAG) framework:
- Orchestrator (LLM 1): Classifies the question type and directs it to specialized workflows.
- Query Generator (LLM 2): Creates precise search strategies tailored to the query’s semantic intent.
- Summarizer (LLM 3): Extracts and condenses key information from retrieved texts.
- Answer Generator (LLM 4): Produces multiple candidate answers based on retrieved context and the model’s intrinsic knowledge.
- Judge (LLM 5): Evaluates candidate responses to synthesize a final consolidated answer.
This modular architecture allows the system to continuously refine answers by integrating background knowledge with curated, up-to-date evidence from trusted databases. Searches pull from more than 37 million biomedical references, primarily using Google and PubMed APIs, ensuring responses are grounded in authoritative clinical guidelines and scientific publications.
The system supports dynamic, conversational interaction via a user-friendly interface that accommodates follow-up questions, enabling clinicians to guide the query process iteratively. Answers are provided with cited references, enhancing transparency and trustworthiness.
Key Results
- Accuracy: Final Vitruvius version (V3) scored 90.26% accuracy on the full 1,273-question MedQA test set.
- Phase-One Screening: On a subset of 288 questions, version 3 achieved 93.06% accuracy, outperforming earlier versions (V1: 85.76%, V2: 90.28%).
- Consistency Across Classes: Precision, recall, and F1-scores ranged narrowly between ~88%-92% across all answer classes (A, B, C, D), indicating balanced performance.
- Agreement Metrics: Cohen’s Kappa coefficient of 86.96% demonstrated high concordance with ground truth answers.
- Benchmarked Superiority: Outperformed GPT-4o (87.51% accuracy) and Med-PaLM 2 (85.4%) tested on the same dataset scale.
Qualitative error analysis revealed occasional mistakes, particularly in questions involving ethical nuances, human behavior interpretation, or cases reliant on image analysis—limitations attributed both to dataset constraints and current AI reasoning challenges.
Interpretation & Clinical Implications
Vitruvius’ ability to combine evidence retrieval with nuanced reasoning represents a meaningful advancement in AI-assisted clinical support. For busy healthcare providers, it offers rapid, conversational access to precise, evidence-backed answers without the need to manually sift through multiple resources.
This can enhance productivity, reduce cognitive overload, and improve the consistency of clinical decision-making. It may be particularly valuable in settings where continuous knowledge updates are difficult to maintain or where specialized expertise is scarce.
However, critical human oversight remains essential. The study emphasizes that despite high accuracy, Vitruvius should act as an adjunct, not a replacement for medical judgment—especially given some error types that could potentially impact patient safety if uncorrected.
Integrating such AI tools must therefore prioritize clinician education, transparent AI reasoning, and clear boundaries regarding autonomous decision-making to maximize benefit while minimizing risk.
Deployment & Scalability
Currently deployed via a web-based conversational interface through Arkangel AI’s platform, Vitruvius is designed for real-time, multilingual clinical use. Its modular design facilitates updates, including incorporation of new medical knowledge and training on additional datasets.
Challenges to broader adoption include ensuring seamless integration within electronic health record (EHR) systems and clinical workflows, managing data privacy and security, and addressing language and cultural context adaptation.
Future deployment strategies could leverage customization to specific specialties or healthcare settings and expand interactions via voice assistants or mobile platforms to maximize utility.
Conclusion & Next Steps
Vitruvius marks a significant step forward in AI-driven medical question answering, combining large language model power with real-time evidence retrieval to achieve state-of-the-art accuracy on challenging licensing exam questions. It offers a compelling tool prototype to augment clinicians’ access to relevant knowledge efficiently and reliably.
Future research should focus on prospective clinical trials assessing impact on workflow efficiency and patient outcomes, extending validation to diverse clinical questions and real-world datasets, and enhancing model transparency and safety features. Engaging frontline clinicians in iterative design will be key to successful implementation.
As AI-powered agents like Vitruvius evolve, they are poised to become indispensable partners in evidence-based medicine—accelerating knowledge-to-practice translation while complementing the indispensable role of human clinical expertise.
References & Study Details
Study Title: Vitruvius: A Conversational Agent for Real-Time Evidence-Based Medical Question Answering
Authors & Affiliations: Maria Camila Villa, Isabella Llano, Natalia Castano-Villegas, Julian Martinez, Maria Fernanda Guevara, Jose Zea, Laura Velásquez; Arkangel AI, Bogotá, Colombia
Key Objective: Develop and evaluate an LLM-based conversational agent specialized in evidence-based medical question answering.
Study Size & Setting: Evaluation on 1,273 clinical questions from the USMLE MedQA dataset.
Time Period: Manuscript posted October 2024.
Study Design: Retrospective evaluation of AI model performance against a validated benchmark dataset.
AI Model Type & Data Sources: Multi-LLM ensemble using GPT-family architectures with integrated PubMed and Google APIs for literature retrieval.
Primary Outcomes: Accuracy in selecting correct answers, precision, recall, F1-scores, and Cohen’s Kappa agreement.
Main Quantitative Results: Version 3 accuracy 90.26%, Cohen’s Kappa 86.96%, outperforming competing models.
Key Implications: Demonstrates feasibility and benefits of real-time, conversational, evidence-based AI support for healthcare professionals.
Deployment Context: Accessible via Arkangel AI platform, supports English, Spanish, Portuguese; designed for research and clinical assistance, not autonomous decision-making.
Link to Paper: https://doi.org/10.1101/2024.10.03.24314861