Understanding Biased Medical Data
The adage "garbage in, garbage out" is often used to describe the limitations of computer systems, but when it comes to biased medical data, this saying lacks nuance. A recent opinion piece published in the New England Journal of Medicine (NEJM) by professors from MIT, Johns Hopkins University, and the Alan Turing Institute argues that a more nuanced approach is needed to address biased AI models in medical settings.
The Problem with Biased Data
When encountering biased data, the typical response is to collect more data from underrepresented groups or generate synthetic data to ensure that the model performs equally well across different patient populations. However, the authors argue that this technical approach should be augmented with a sociotechnical perspective that takes into account historical and current social factors. By doing so, researchers can be more effective in addressing bias in public health.
Data as Artifact
The authors suggest viewing biased clinical data as "artifacts" that reveal practices, belief systems, and cultural values that have led to existing inequities in the healthcare system. For example, a 2019 study showed that an algorithm used healthcare expenditures as an indicator of need, leading to the erroneous conclusion that sicker Black patients require the same level of care as healthier white patients. This highlights the need to consider social and historical elements that influence how data is collected and used in clinical AI development.
The Importance of Context
The authors emphasize the importance of considering context when working with biased datasets. For instance, biases present in a dataset collected for lung cancer patients in a hospital in Uganda might be different from a dataset collected in the U.S. for the same patient population. By taking into account local context, algorithms can be trained to better serve specific populations.
When More Data Can Harm Performance
The authors also note that including personalized attributes like self-reported race in clinical risk scores can actually lead to worse risk scores, models, and metrics for minority and minoritized populations. This highlights the need for a nuanced approach to addressing bias in medical data, one that considers the complex social and historical factors that shape healthcare outcomes.
Moving Forward
The National Institutes of Health (NIH) has prioritized the collection of high-quality, ethically sourced datasets to enable the use of next-generation AI technologies in healthcare. The NIH’s $130 million Bridge2AI Program aims to drive ethical practices in data collection and use. By prioritizing context, considering biased datasets as artifacts, and taking a nuanced approach to addressing bias, researchers can develop safe and effective clinical AI models that improve healthcare outcomes for all populations.
Expert Insights
Elaine Nsoesie, an associate professor at the Boston University of Public Health, believes that treating biased datasets as artifacts rather than garbage can have many benefits, including a focus on context and the identification of discriminatory practices. Marzyeh Ghassemi, a co-author of the NEJM piece, notes that people should be more concerned about the current state of healthcare than the potential risks of AI. By acknowledging the problems with existing healthcare systems, researchers can take the first step towards creating a more equitable and just healthcare system.
Conclusion
In conclusion, addressing biased medical data requires a nuanced approach that considers social and historical factors, context, and the complex relationships between data, algorithms, and healthcare outcomes. By treating biased datasets as artifacts and prioritizing ethical data collection and use, researchers can develop clinical AI models that improve healthcare outcomes for all populations and create a more equitable and just healthcare system.