Arkangel AI real-time multi-LLM agent answers clinicians' medical questions with 90% accuracy

Arkangel AI: multi-LLM retrieval system gives evidence-based medical answers with 90% accuracy.

August 15, 2025by Jose Zea

Arkangel AI: Harnessing Large Language Models for Real-Time, Evidence-Based Medical Question Answering with 90% Accuracy

In a healthcare landscape overwhelmed by rapidly expanding medical knowledge, clinicians often struggle to access timely, relevant, and trustworthy information during decision-making. Traditional methods of medical question answering rely heavily on static databases or individual expertise, which can limit speed and comprehensiveness. Addressing this challenge, Arkangel AI introduces a conversational agent powered by multiple large language models (LLMs) designed to deliver real-time, evidence-based answers to complex medical queries with remarkable accuracy.

By leveraging an innovative multi-LLM architecture combined with real-time information retrieval from trusted sources like PubMed and Google, Arkangel AI achieves a notable 90.26% accuracy on the rigorous MedQA benchmark—surpassing many current state-of-the-art medical LLMs. This breakthrough highlights the potential of AI-assisted research assistants to augment clinical reasoning, streamline workflows, and improve access to vetted medical knowledge.

Introducing Arkangel AI: A Colombian Innovation at the Frontier of Medical AI

Developed by a multidisciplinary team based in Bogotá, Colombia, at Arkangel AI, this model reflects a growing global push for AI solutions that meet regionally relevant healthcare needs. Colombia, like many countries, faces disparities in access to updated clinical guidelines and scientific literature, making the rapid synthesis of trustworthy medical information vital.

The development team aimed to build a tool that not only processes complex clinical and research queries but also supports multilingual interactions in English, Spanish, and Portuguese — crucial for Latin American clinicians and researchers. This culturally and linguistically tailored approach ensures wider applicability and usability in diverse clinical environments.

Study Design and Methodology: Multi-LLM Architecture Meets Rigorous Validation

The study evaluated Arkangel AI's performance on two extensive and well-recognized medical question answering datasets: MedQA (1,273 test questions from USMLE examinations) and PubMedQA (500 human-evaluated biomedical research questions). The data spanned diverse medical specialties and subfields collected up to early 2025.

Arkangel AI’s architecture integrates five interconnected LLMs operating within a Retrieval-Augmented Generation (RAG) framework. This system dynamically retrieves relevant documents via Google and PubMed APIs, then processes and summarizes the information to produce contextually accurate answers. Specifically:

LLM 1 and 2: Classify the query type and optimize search strategy.
RAG Module: Retrieves the top ten most relevant documents per query, filtered for quality and safety.
LLM 3 and 4: Summarize retrieved content and generate multiple candidate answers.
LLM 5: Acts as an internal “judge,” reasoning through generated responses to choose the most accurate.

The system classifies queries into four workflows—Clinical Reference, Clinical Research, Diagnostic, and General Information—to tailor retrieval and response approaches efficiently.

Key Results: Quantitative Evidence of Performance Excellence

Accuracy: 90.26% on the MedQA test set, outperforming leading LLM benchmarks such as GPT-4o (87.51%) and Med-PaLM 2 (85.4%).
Cohen’s Kappa: 86.96%, indicating near-perfect agreement with reference answers.
Consistency: High sensitivity, precision, and F1-scores above 89% across varied question classes with no statistical bias.
Workflow Classification Accuracy: 94.5% overall, with highest accuracy in Clinical Research (100%) and Diagnostic (98.2%) workflows.
Retrieval Metrics: Retrieved 80.2% of expected articles in PubMedQA, with context precision at 55% in MedQA and response relevance exceeding 82% in PubMedQA.
Response Faithfulness: Over 57% of answers in MedQA were supported directly by retrieved sources; while some correct answers drew on LLM background knowledge, indicating effective hybrid reasoning.
Efficiency: Average response time was approximately 2.6 minutes per query, practical for clinical and research workflows.

Clinical Interpretation and Implications

The demonstrated high accuracy and reliability position Arkangel AI as a valuable tool to augment decision-making in clinical and research settings. Its ability to retrieve, synthesize, and present evidence-based answers within minutes can help clinicians stay abreast of evolving guidelines and research, particularly in resource-constrained environments.

By classifying queries into distinct workflows, the system adapts its search and reasoning strategy to the specific clinical context, potentially improving relevance and trustworthiness. The multi-LLM "judging" mechanism also effectively mitigates common challenges with hallucinated or inconsistent AI outputs.

Nonetheless, the authors prudently emphasize that Arkangel AI serves as a decision support tool—not a replacement for clinical judgment. Continued improvement in prompt engineering and external validation with real-world clinician feedback are key next steps to maximize safety and utility.

Deployment Potential and Scalability

Arkangel AI is accessible through a conversational platform supporting English, Spanish, and Portuguese, facilitating adoption across Latin America and other multilingual settings. The modular API integration leverages existing, trusted information resources, enhancing transparency and auditability.

Barriers to clinical deployment include integration into electronic health records, ensuring patient privacy, and clinician training in effective prompt formulation. The Arkangel AI team has begun addressing these by providing educational resources and considering workflow embedding strategies.

The adaptable architecture is well-suited for expansion into other medical domains and geographies, provided relevant localized databases and guidelines are incorporated. Future iterations may include multimodal inputs such as images or lab data to enhance diagnostic capacities.

Conclusion and Next Steps

Arkangel AI embodies a significant advance in AI-powered medical question-answering, combining state-of-the-art LLM capabilities with rigorous real-time retrieval and multi-step reasoning. Its impressive accuracy underscores the potential for conversational agents to become integral clinical support tools in the near future.

Ongoing research priorities include external clinical validation, enhancement of workflow classification fidelity, reduction of reliance on baseline LLM knowledge alone, and exploration of integration pathways into routine care. As AI literacy grows among healthcare providers, tools like Arkangel AI can help bridge knowledge gaps, foster evidence-based practice, and ultimately improve patient outcomes.

For healthcare innovation leaders, Arkangel AI offers a compelling example of how tailored, multi-LLM systems can revolutionize information access and clinical decision support.

Reference

Villa MC, Castano-Villegas N, Llano I, Martinez J, Guevara MF, Zea J, Velásquez L. Arkangel AI: A conversational agent for real-time, evidence-based medical question-answering. Intelligence-Based Medicine. 2025;12:100274. https://doi.org/10.1016/j.ibmed.2025.100274