AI-assisted mammography must move into a critical new phase of prospective clinical evaluation

To impact and improve patient care, the study of artificial intelligence (AI) tools to support screening mammography must now shift from simulated research trials to robust clinical evaluations, according to Constance Lehman, MD, Ph.D., director of Breast Imaging at Massachusetts General Hospital (MGH).

In a commentary in JAMA Oncology, Lehman commended researchers of a recent study who reported on the performance of three available AI tools in a high quality modern, all-digital screening mammography database and agreed that the time has come to progress beyond simulated research strategies to carefully measure and determine the performance of models in clinical practice.

“In the continued evolution of AI applied to improving human health, it is time to move beyond simulation and reader studies and enter the critical phase of rigorous, prospective clinical evaluation,” said Lehman in her commentary accompanying the study.

The head of breast imaging at MGH pointed out that early efforts to develop AI-based deep learning models to assist humans in mammographic interpretation have produced mixed results, including wide variations in quantity and quality of data used for model development, and variable methods to train, test, and internally and externally validate deep learning models.

“Work in both development and validation is needed in larger data bases, including tomosynthesis examinations and diverse [commercial AI] vendors and patient populations,” Lehman explained. “But most importantly, rigorous studies to assess whether results from simulation studies will translate to success in routine clinical practice are now essential.”

Lehman credited the researchers [Salim and colleagues] for taking an important next step in the discovery process through their use of a large, curated screening mammography database to compare performance of the three commercial algorithms.

That methodology led to the finding that one of the three commercial models achieved a sensitivity of 81.9% with the specificity set at 96.6%. Those results compare favorably with the U.S. Breast Cancer Surveillance Consortium benchmarks of 86.9% sensitivity and 88.9% specificity.

In addition, Lehman found intriguing the authors’ “insights that challenge existing assumptions in the field.” She cited results suggesting that “the volume of cases may be more important than the diversity of vendors or patient populations in the databases used to develop the algorithm.”

Lehman went on to note that the highest performing algorithm was developed from the largest dataset of screening mammograms—72,000 cancer images and 680,000 normal images in the top performing model compared to 6,000 cancer images and 106,000 normal images in the lowest performing model.

In calling for additional studies, Lehman, who is also professor of radiology at Harvard Medical School, opined that prior failures with computer-aided detection (CAD) programs should serve as a cautionary lesson going forward.

“Although early reader and simulation studies of traditional CAD were encouraging, in the end, improved outcomes for patients receiving mammogram interpretations supported by CAD were not shown,” she pointed out. “Many studies have confirmed that humans respond differently to CAD assistance, and the same may be true for AI-assisted readings.”

The time has clearly come for a more rapid pace of research in this domain that’s coupled with “safe, careful and effective testing in prospective clinical trials,” Lehman concluded. “If AI models can be developed that can reliably detect on mammograms women with cancer from those without it, then quality, affordable screening mammography may finally become available to a large population of women globally who currently have no access to its life-saving potential.”