PANDORA AI Extracts EHR Data, Identifies COPD Risk in Patients with 98% PUMA Accuracy

PANDORA used GPT-4 to extract clinical notes and apply PUMA: >90% extraction, 95-98% COPD scoring.

August 15, 2025by Jose Zea

PANDORA AI: Automating Clinical Data Extraction and COPD Risk Scoring with Over 90% Semantic Accuracy

Extracting valuable clinical data from unstructured electronic health records (EHRs) remains a major challenge in healthcare, limiting researchers and clinicians from fully leveraging patient information. A novel artificial intelligence model, PANDORA, has demonstrated exceptional ability to automatically extract structured data from free-text medical notes and apply validated clinical risk scores to drive diagnostic recommendations. In testing with real-world and synthetic patient data, PANDORA achieved semantic extraction scores above 90% and correctly identified COPD risk via the PUMA scale in up to 98% of cases.

This breakthrough highlights how generative AI models can transform inaccessible clinical narratives into actionable insights, a crucial step toward broader use of real-world data in clinical decision-making and research.

Addressing the Hidden Burden of Unstructured Clinical Data

Electronic health records house a wealth of patient information, yet much of it remains locked in free-text formats like physician notes or discharge summaries. These unstructured texts are notoriously difficult to analyze systematically, requiring time-consuming manual review or complicated data cleaning. Existing methods to extract clinical details often fall short in accuracy or scale, slowing down research progress and clinical workflows.

Meanwhile, timely and accurate risk stratification tools, particularly for chronic diseases like chronic obstructive pulmonary disease (COPD), are essential to optimize care but often depend on structured datasets not consistently available.

PANDORA leverages advances in large language models (LLMs), specifically the latest GPT-4 architecture, to bridge these gaps. Its dual-algorithm design first extracts relevant clinical variables from raw EHR texts, then applies a validated COPD risk scoring algorithm—the PUMA scale—to provide automated diagnostic guidance. This integration of natural language processing with clinical decision support represents a considerable leap forward in actionable data extraction.

Study Partnership and Real-World Relevance

This study was conducted by a multidisciplinary team at Arkangel AI, which collaborated closely with clinicians and data scientists experienced in pulmonology and informatics. The research utilized two critical data sources: the MIMIC-IV database, consisting of anonymized real hospital records from Beth Israel Deaconess Medical Center in Boston, and a synthetic dataset designed to mimic Colombian outpatient clinical records based on standardized histories.

These datasets represent both diverse real-world complexity and context-specific clinical scenarios, making PANDORA’s validation particularly relevant for healthcare systems with limited structured data repositories—common in many low- and middle-income settings.

Study Design and Methodology

The validation study analyzed two cohorts:

MIMIC-IV notes: Thousands of hospital discharge summaries and clinical notes from patients in Boston, USA, containing real, complex language and clinical variability.
Synthetic Colombian outpatient cases: Expert-designed simulated EHRs reflecting typical COPD-relevant patient encounters in Latin America.

PANDORA operates through two interconnected algorithms:

Extraction Algorithm: Processes unstructured EHR text and extracts relevant clinical variables necessary for scoring the PUMA COPD risk scale, such as smoking history, symptoms, and spirometry results.
Scoring Algorithm: Computes the PUMA risk score (range 0–9) and recommends COPD diagnostic evaluation if the score exceeds the threshold of 5.

The model’s performance was assessed using three semantic metrics — BERTScore, SemanticScore, and RelevanceScore — capturing how well the AI-generated extractions matched reference answers. Additionally, human clinicians evaluated PANDORA’s accuracy in extracting data, applying the PUMA score, and making diagnostic recommendations.

Key Results

Semantic extraction metrics: Scores exceeded 90% across all metrics (BERTScore 0.911, SemanticScore 0.925, RelevanceScore 0.901), indicating strong understanding and coherence.
Data extraction accuracy: 100% for MIMIC-IV and 99% for synthetic cases per human evaluation.
PUMA scoring accuracy: Correct score calculation in 98% of MIMIC-IV cases and 95% of synthetic cases.
Diagnostic recommendation for COPD: 86% precision against MIMIC-IV standards and 100% accuracy on synthetic cases.
Sensitivity and specificity (MIMIC-IV): Sensitivity of 0.885, specificity 0.700 for COPD risk detection, reflecting high true positive rate but moderate false positives due to PUMA’s screening design.
Overall recommendation accuracy: Approximately 94-99% correctness in flagging COPD risk across both data sources.

Interpretation and Clinical Implications

PANDORA’s demonstrated ability to extract structured data from narrative clinical notes with high accuracy unlocks previously inaccessible information for both clinical and research use. Automating the application of validated risk scores such as PUMA enables early identification of high-risk COPD patients without additional clinician burden.

For patients, this means potentially earlier diagnosis and intervention. For clinicians, the tool offers an efficient means to synthesize complex record data into actionable insights. For health systems, especially those lacking structured EHRs or facing resource constraints, PANDORA provides a scalable solution to harness their existing clinical data for quality improvement and epidemiologic insights.

That said, the moderate specificity reflects PUMA’s conservative screening design, which tends to flag more potential cases to reduce missed diagnoses. Future work could tailor thresholding or incorporate other scoring algorithms to improve precision in broader populations.

Deployment and Scalability Potential

While still in early validation phases, PANDORA’s reliance on advanced LLM architecture coupled with automated, end-to-end extraction and scoring pipelines makes it well-suited for integration into clinical workflows. Health institutions without structured data infrastructures could deploy it directly on free-text clinical documentation, instantly enabling risk stratification and decision support.

Challenges remain around ensuring data privacy, adapting to local clinical languages and documentation styles, and integrating AI outputs into electronic medical record interfaces safely. However, the modular design suggests adaptability beyond COPD and PUMA, potentially extending to other diseases where clinical data mostly resides in text.

Conclusion and Next Steps

PANDORA AI sets a new benchmark for extracting meaningful clinical data from free-text records and applying validated risk scores automatically. Its high semantic accuracy and robust performance across diverse datasets highlight the promise of generative AI to bridge a longstanding gap in real-world data utilization.

Future research should focus on expanding to additional clinical domains, refining specificity and thresholding for screening tools, and piloting real-world clinical deployment to assess impact on diagnostic accuracy and care pathways. By opening the “black box” of unstructured clinical text, solutions like PANDORA could significantly enhance precision medicine and health system analytics globally.

For those interested in further details, the preprint of this study is available through Arkangel AI’s release and associated references.