Research and Analysis of Healthcare Data Breach Risk Prediction in the US Based on Interpretable Machine Learning

Mingyang Sun; Yuxin Wu; Rongtian Ye

doi:10.1051/itmconf/20268401018

Open Access

Issue		ITM Web Conf. Volume 84, 2026 2026 International Conference on Advent Trends in Computational Intelligence and Data Science (ATCIDS 2026)


Article Number		01018
Number of page(s)		8
Section		Intelligent Computing in Healthcare and Bioinformatics
DOI		https://doi.org/10.1051/itmconf/20268401018
Published online		06 April 2026

ITM Web of Conferences 84, 01018 (2026)

Research and Analysis of Healthcare Data Breach Risk Prediction in the US Based on Interpretable Machine Learning

Mingyang Sun¹^*, Yuxin Wu² and Rongtian Ye³

¹ School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai, China
² Information School, University of Washington, Seattle, Washington, The United States
³ College of Arts and Science, Syracuse University, Syracuse, New York, The United States

^* Corresponding author’s email: This email address is being protected from spambots. You need JavaScript enabled to view it.

Abstract

This research is based on the Office for Civil Rights at the Department of Health and Human Services (OCR-HHS) breach reports (2019-2025) to build an interpretable machine learning model that forecasts incidents with a high impact ( ≥ 100,000 people) and likelihood of a ransomware. They include entity type, breach method and the location of compromised information. A comparative analysis was made between the logistic regression and random forest models to provide transparency and accuracy, calculate the calibration analysis, and feature importance analysis. The anticipated benefits are actionable tiered controls risk scores, improved incident preparedness, and governance decision support of healthcare cybersecurity. Research is based only on publicly available aggregated statistics, but not on patient-related data, which meets professional and regulatory ethics. Findings indicate that the two models are better than random baselines and they can provide noteworthy early-warning information; however, overall, the discriminative ability is low. Critical elements - attack vectors and information location - provide consumable results even to operational security planning. Generally, the findings have shown that interpretable predictions that are data driven can be feasible in reinforcing proactive cybersecurity governance within the healthcare industry.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.