PANDORA LLM Automates COPD Risk Detection with Near-Perfect Extraction and 94% PUMA Accuracy

LLM auto-extracted ICU/outpatient (~100%) and applied PUMA: 94% scoring accuracy; 100% sensitivity

August 15, 2025by Jose Zea

PANDORA: Harnessing Large Language Models to Automate Clinical Data Extraction and COPD Risk Assessment with Near-Perfect Accuracy

In modern healthcare, vast amounts of valuable patient information remain trapped in unstructured clinical notes, limiting their effective use for diagnosis, risk stratification, and research. The innovative AI system PANDORA leverages advanced Large Language Models (LLMs) to automatically extract key clinical features from raw medical documents and apply validated risk scoring—demonstrated here with Chronic Obstructive Pulmonary Disease (COPD) case finding—and delivers recommendations with remarkable accuracy.

In this landmark study, researchers from Arkangel AI in Bogotá, Colombia, validated PANDORA's performance using complex ICU discharge notes from the MIMIC-IV database and synthetically generated outpatient cases. The system achieved near-perfect data extraction accuracy (100% on MIMIC notes and 99.6% on synthetic charts) and correctly applied the PUMA COPD screening score with 94% accuracy, ultimately identifying patients at risk for COPD with sensitivities up to 100%. These results showcase PANDORA’s potential to transform unstructured clinical text into actionable insights, closing a critical gap in real-world healthcare data utilization.

Introduction: Unlocking Knowledge from Unstructured Clinical Text

Clinical records are the backbone of patient care, yet up to 80% of the data they contain exist in unstructured formats such as physician notes, discharge summaries, and narrative reports. This 'free-text' information holds crucial insights—symptoms history, smoking status, test results, and more—that have been historically labor-intensive to extract and incorporate into decision-making workflows. This bottleneck results in lost opportunities for early diagnosis, population health management, and research bias due to incomplete datasets.

Traditional methods for harnessing unstructured data have relied on manual chart reviews or rule-based natural language processing (NLP) systems with limited adaptability. Meanwhile, recent advances in LLMs, which understand context and medical terminology at high levels, have opened a new frontier for scalable, accurate information extraction directly from raw clinical text.

Enter PANDORA: a modular AI framework composed of two synchronized LLM agents designed to extract relevant clinical features from unstructured electronic health records (EHRs) and automatically implement clinical risk scores based on these features. In this study, the focus was on COPD risk assessment through the established PUMA screening tool, testing how well PANDORA can replicate expert-level data extraction and scoring accuracy using both real-world and synthetic clinical data.

Study Context and Partnership

This study was conducted by the Arkangel AI team in Bogotá, Colombia, reflecting a growing effort in Latin America to harness AI tools tailored to regional healthcare needs. The decision to focus on COPD stems from its high prevalence worldwide—particularly in Latin America—with substantial rates of underdiagnosis, estimated as high as 89%. Early and accurate identification of COPD risk remains a pressing unmet need in outpatient and critical care settings alike.

The inclusion of the MIMIC-IV dataset, comprising detailed ICU discharge notes from a major U.S. academic medical center, ensures that the model was tested on complex, real-world clinical documentation representative of severe illness cases. Complementing this, synthetically generated outpatient charts modeled after Colombian primary care consultations expanded evaluation to more typical, diverse clinical scenarios.

Study Design and Methodology

The PANDORA system consists of two core phases:

Extraction Phase: An LLM-based module processes unstructured EHR text to extract predefined clinical features relevant to COPD risk, such as smoking history, symptom chronicity, and prior diagnoses.
Scoring and Recommendation Phase: Using the extracted data, a second LLM applies the PUMA COPD score—a validated 7-criterion clinical calculator determining the need for spirometry testing—and generates a binary COPD risk classification (positive if score ≥5).

Data sources included:

MIMIC-IV database: 615 evaluated QA pairs from discharge notes spanning 2002 to 2019 within ICU patient records at Beth Israel Deaconess Medical Center.
Synthetic outpatient clinical charts: 700 QA pairs generated with GPT technology following Colombian clinical documentation standards to simulate diverse COPD differential diagnoses.

Evaluation metrics focused on:

Extraction Accuracy: Correct identification of clinical features from unstructured text, benchmarked against expert-validated QA pairs.
Scoring Accuracy: Correct calculation of the PUMA COPD risk score from extracted data.
Recommendation Performance: Sensitivity, specificity, precision, accuracy, F1 score, and Cohen’s Kappa for COPD risk classification.

Key Results

Extraction Phase:
- 100% accuracy for MIMIC discharge notes (615 QA pairs).
- 99.6% accuracy for synthetically generated outpatient charts (700 QA pairs).
Scoring Phase:
- 94.5% accuracy calculating PUMA scores on MIMIC data.
- 94.1% accuracy on synthetic case scores.
Recommendation Phase (COPD risk classification):
- Sensitivity: 85.5% (MIMIC with history of COPD considered), 19.4% (MIMIC without history), and 100% (synthetic cases).
- Specificity: 70% (MIMIC with history), 92.5% (MIMIC without history), but only 20% (synthetic cases).
- Overall accuracy: 79.4% (MIMIC with history), 48.0% (MIMIC without history), and 36.0% (synthetic cases).
- The inclusion of prior COPD diagnosis as a feature improved sensitivity dramatically by 66% but reduced specificity by 22.5%.

Interpretation and Clinical Implications

PANDORA’s ability to achieve near-perfect extraction accuracy across highly heterogeneous, unstructured clinical texts is a significant advance, demonstrating that large language models can reliably identify essential clinical elements without preprocessing or structured input. This feature alone could drastically reduce the manual effort traditionally required for EHR data abstraction.

More importantly, the system’s integration with a validated COPD screening tool (PUMA) and its high accuracy in replicating risk stratification signals a new era where AI can seamlessly bridge text extraction with evidence-based clinical decision support. In practice, this means clinicians could receive automated alerts on patients at risk of COPD during routine chart review, facilitating timely spirometry testing and earlier diagnosis.

The observed differences in specificity between the ICU-heavy MIMIC dataset and the synthetic outpatient charts highlight the importance of contextualizing AI tools to the patient populations and clinical environments where they are deployed. The high sensitivity but reduced specificity of PANDORA in synthetic outpatient settings reflect PUMA’s inherent design to prioritize case finding over false negatives, suitable for opportunistic screening but requiring calibration in broader populations.

Furthermore, incorporating known COPD history into the risk assessment improved detection capabilities substantially, an example of how combining extracted data with clinical logic enhances model utility.

Deployment and Scalability

The modular architecture of PANDORA enables straightforward integration into hospital EHR systems or outpatient clinical software platforms. It can process clinical notes in real time or batch mode, allowing healthcare providers to rapidly surface key information and guideline-based recommendations.

Potential deployment barriers include variability in EHR documentation styles across institutions and countries, variable availability of critical features (e.g., smoking history often redacted in de-identified datasets), and the need for continuous human supervision to address LLM biases and errors.

However, PANDORA’s reliance on universally validated clinical scores like PUMA allows adaptability: by substituting or adding other disease-specific validated tools, the system could be expanded to screen or manage multiple conditions beyond COPD, including cardiovascular risk, diabetes, and infectious diseases.

Conclusion and Future Directions

PANDORA represents a pioneering step in applying large language models for automated extraction of unstructured clinical data and applying validated clinical scores in one integrated system. Its outstanding performance in COPD risk identification highlights the promise of AI in enhancing early diagnosis and personalized decision-making without requiring laborious manual data curation.

Future work should focus on prospective validation in varied healthcare settings, refining specificity through threshold calibration, and expanding PANDORA’s scope to additional diseases and multi-language capabilities. With ongoing human oversight and model updates, such innovations hold great potential to streamline workflows, reduce diagnostic delays, and improve patient outcomes globally.

Reference: Jimenez D, Castano-Villegas N, Llano I, Martinez J, Ortiz L, Velasquez L, Zea J. PANDORA: An AI model for the automatic extraction of clinical unstructured data and clinical risk score implementation. 2025 IEEE Conference on Artificial Intelligence (CAI). DOI: 10.1109/CAI64502.2025.00280