How much data is needed to create a great AI algorithm in healthcare

There are plenty of doubts when creating an algorithm in healthcare. One of the biggest today is how large my dataset should be to create a great predictive model? The goal of...

by Jose Zea2 min read

There are plenty of doubts when creating an algorithm in healthcare. One of the biggest today is how large my dataset should be to create a great predictive model?

The goal of your AI project should be to help physicians and healthcare personnel to make crucial decisions regarding the immediate treatment of a patient, thereby creating a direct line of care that helps to save lives. Therefore, every single algorithm must start from a dataset containing information from a sample of patients representing the predictive target and supporting variables (the same variables will be used to make a future prediction later on).

The goal of your dataset is to show your algorithm with sufficient quality and a good representation of your target population and the given settings where you will be using it. Thus, the quality (the variables it includes and its availability) is as important as the quantity.

First, some simulation studies conducted in the 1990s [1] [2] [3], suggest that a common rule of thumb when building your dataset should ensure at least ten events per variable. In other words, if your model has 20 variables, you should provide at least 200 records for that dataset. This rule of thumb has been widely promoted because of its simplicity. But it lacks the fact that "variable" may be a misleading word depending on your algorithm blueprint.

Another common rule of thumb is to use as much data as possible. It is well known that larger datasets tend to provide more robust algorithms. But having a large dataset with records that have more than 20% missing variables, with plenty of outliers, supporting variables not available in your settings, or a cohort of patients that do not represent yours may be more harmful than training with a smaller dataset.

Finally, developing a predictive algorithm requires context-specific needs; it is impossible to follow a rule of thumb for every case. Remember that your data aims to show the algorithm a good representation of your target population and the given settings where you will be using it. For example, suppose you want to predict whether a patient may develop Chronic Kidney Disease from Type II Diabetes. It would be best to size your dataset to be a minimum of the portion of patients that develop the outcome (in this case, CKD) and those who do not (in this case, that will stay in T2D) in your settings using variables accessible to you. Thus, the number of records is as significant as the diversity in the population, the outcome proportion (incidence), the variables you choose, and the comparative expected predictive performance.

In conclusion, use the largest quality dataset for your given task and settings. Assume you do not have such data points. Then start validating the clinical need first, train a V1 of your model, deploy it in a secure environment, collect data, and improve it over time; ultimately, this continuous learning is the most significant benefit of ML.

  1. Source: Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol1995;48:1503-10. doi:10.1016/0895-4356(95)00048-8 pmid:8543964 Google Scholar
  2. Source: Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol1996;49:1373-9. doi:10.1016/S0895-4356(96)00236-3 pmid:8970487 Google Scholar

If you want to know more about Arkangel Ai contact us here and one of our team members will contact you for a one-on-one session.