Learn how natural language processing in healthcare is transforming medical research and how Pandora optimizes unstructured data mining.
According to an analysis by Health Catalyst, Natural Language Processing (NLP) has enabled researchers and physicians to convert large volumes of unstructured text into useful and accessible data. This type of text, which includes electronic health records, clinical notes, and lab reports, constitutes about 80% of patient data. With the use of NLP, it is possible to extract valuable information for clinical decision-making and predictive analytics (Health Catalyst). Among the most prominent innovations in this field is the creation of Pandora, developed by Arkangel AI, a language model specifically designed for the extraction and structuring of medical data.
In medicine, much of the relevant information is found in medical records, discharge notes, and other plain text documents. This complicates data analysis and slows down evidence-based decision-making. Manual processing of this information is slow, costly, and prone to errors. With NLP, it's possible to automate the analysis of complex texts and generate structured data that can be used in clinical practice and research.
Pandora is a generative AI model that processes natural language, facilitating the extraction and structuring of information from unstructured sources. Pandora was specifically designed to overcome the challenges inherent in handling large volumes of unlabeled and difficult-to-access medical data. This model is equipped with two key algorithms that work together to retrieve information and offer recommendations according to the scale or clinical guideline used on the extracted information, as determined by the researcher.
Pandora's operation has two main phases:
To validate its effectiveness, Pandora was tested using two main data sources: the MIMIC-IV-Note database, which collects anonymized medical notes, and a base of 100 synthetic clinical histories generated by AI from a guide with 9 hypothetical clinical cases, in the context of an outpatient consultation, following the guidelines of the Colombian Ministry of Health. We applied human evaluation to each of the cases to assess the model's data extraction capabilities, application of a risk scale, and recommendation generation.
For extraction, we decided to use the PUMA scale, validated in several Latin American countries for risk assessment and case finding in Chronic Obstructive Pulmonary Disease (COPD).
The recommendation regarding COPD risk, based on the PUMA score, had a sensitivity of 100% in synthetic cases and 89% in MIMIC and 89% in real cases. However, the specificity was less than 80% for both.
During the validation process, Pandora showed a good capacity for extraction and use of the PUMA calculator for extraction in MIMIC and synthetic cases. The low specificity is due to the PUMA calculator being a very sensitive but not very specific tool for case finding, and considering that all cases used in the model (MIMIC and Synthetic) have patients with cardiorespiratory diseases, the calculator classified many of the differential diagnoses as COPD.
This result suggests that the PUMA calculator may not be entirely adequate for the population used or that using a higher cut-off point may provide better results, which can be explored in a subsequent study. Despite this, Pandora proved to be an effective tool for extracting clinical data and implementing a PUMA risk calculator, with potential to be adapted to different clinical scenarios requiring different risk measurements.
The development of Pandora marks a milestone in the use of natural language processing in health. As this technology continues to evolve, it is expected to be applied to a greater variety of diseases and clinical scales, leveraging its modular architecture and capacity for continuous improvement. Furthermore, future research could expand the validation of Pandora using real-world data, optimizing its accuracy and specificity in different medical contexts.
Ingresa a nuestro paper en: https://app.hubspot.com/documents/8676854/view/881078300?accessId=193a29