Privacy-preserving healthcare informatics: a review

. Electronic Health Record ( EHR ) is the key to an efficient healthcare service delivery system. The publication of healthcare data is highly beneficial to healthcare industries and government institutions to support a variety of medical and census research. However, healthcare data contains sensitive information of patients and the publication of such data could lead to unintended privacy disclosures. In this paper, we present a comprehensive survey of the state-of-the-art privacy-enhancing methods that ensure a secure healthcare data sharing environment. We focus on the recently proposed schemes based on data anonymization and differential privacy approaches in the protection of healthcare data privacy. We highlight the strengths and limitations of the two approaches and discussed some promising future research directions in this area.


Introduction
EHR systems are increasingly adopted as an important paradigm in healthcare industry to collect and store patient data, which include sensitive information such as demographic data, medical history, diagnosis code, medications, treatment plans, hospitalization records, insurance information, immunization dates, allergies and laboratory and test results. The availability of such big data has provided unprecedented opportunities to improve the efficiency and quality of healthcare services, particularly on improving the patient care outcomes and reducing medical costs. EHR data are published to allow useful analysis that are required by healthcare industries [1] and government institutions [2][3]. Some key examples may include large-scale statistical analytics (eg. study of correlation between diseases), clinical decision making, treatment optimization, clustering (eg. epidermics control) and census survey. Driven by the potential of EHR systems, a number of EHR repositories have been established, such as National Database for Autism Research (NDAR), UK Data Service, ClinicalTrials.gov and UNC Health Care (UNCHC).
Although the publication of EHR data is enormously beneficial, it could lead to unintended privacy disclosures. Many conventional cryptography technologies have been deployed to primarily protect the security of the EHR systems, such as access control, authentication and encryption. However, these technologies do not provide guarantee on privacy preservation. That is, the sensitive information of patient could still be inferred from the published data by an adversary. Various policies and guidelines are developed to restrict the type of publishable data and agreements on the usage and storage of data. For instance, US Health Insurance Portability and Accountability Act (HIPAA) [4][5], EU General Data Protection Regulation (GDPR) [6][7] and Personal Data Protection Act [8]. The limitations of this approach are: i) A high trust level is required on the data recipient that they follow the rules and regulations provided by the data publisher. Yet, there are adversaries who attempt to attack the published data to reidentify a target victim. ii) The sensitive data might be carelessly published due to human error and fall into the wrong hands, which eventually leads to the privacy breach of individual. Nevertheless, policies and governmental acts do not provide computational guarantee for preserving privacy of patient and thus cannot fully prevent such privacy violations. The need of protecting individual data privacy in a hostile environment while allowing accurate analysis on the patient has driven the development of effective privacy models in protecting healthcare data.
In this paper, we present the privacy issues in healthcare data publication and elaborate on relevant adversarial attack models. We focus on data anonymization and differential privacy and discuss the limitations and strengths of the proposed approaches. Finally, we conclude the paper and highlight the future research direction in this area.

Privacy threats
In this section, we first discuss privacy-preserving data publishing (PPDP) and the properties of healthcare data. Then, we present the major privacy disclosures in healthcare data publication and show the relevant attack models. Finally, we present the privacy and utility objective in PPDP.

Privacy-preserving data publishing
Privacy-Preserving Data Publishing (PPDP) provides technical solutions that address privacy and utility preservation challenges of data sharing scenarios. An overview of PPDP is shown in Fig. 1, which includes a general data collection and data publishing scenario. During the data collection phase, data of record owner (patient) are collected by the data holder (hospital) and stored as EHR. In the data publishing phase, the data holder releases the collected data to the data recipient (e.g. the public or a third party such as insurance industry and medical center) for further analysis and data mining task. However, some of the data recipients (adversary) are not honest and attempt to obtain more information about record owner beyond the published data, which includes the identity and sensitive data of record owner. Hence, PPDP serves as a vital phase that sanitizes personal sensitive information to avoid privacy violations.

Healthcare data
Typically, healthcare data are relational data in tabular form. Each row (tuple) corresponds to one record owner and each column corresponds to a number of distinct attributes, which can be grouped into the following four categories: • Explicit identifier (ID): It is a set of attributes that uniquely identifies a record owner, such as name, social security number, national IDs, mobile number and driving license number. • Quasi-identifier (QID): It is a set of attributes that cannot uniquely identify a record owner, but potentially identify the target if combined with some auxiliary information. For example, date of birth, gender, address, zip code and hobby. • Sensitive attribute (SA): It is a sensitive personal information that the record owner intends to keep private from unauthorized parties. Example includes diagnosis code, genomic information, salary, health condition, insurance information and relationship status. The attribute can be further divided into numerical attribute (e.g. age, zip code and date of birth) and non-numerical attribute (e.g. gender, job and disease). Table 1 shows an example dataset, in which the name of patients is naively anonymized (by removing the names and social security numbers).

Privacy disclosures
A privacy disclosure is defined as a disclosure of personal information that users intend to keep private from an entity which is not authorized to access or have the information. There are three types of privacy disclosures: • Identity disclosure: Identity disclosure, also known as reidentification, is the major privacy threat in publishing healthcare data. It occurs when the true identity of a targeted victim is revealed by an adversary from the published data. In other words, an individual is reidentified when an adversary is able to map a record in the published data to its corresponding patient with high probability (record linkage).

For example, if an adversary possesses the information that A is 43 years old, then
A is reidentified as record 7 in Table 1. • Attribute disclosure: It occurs when an adversary successfully links a victim to their SA information in the published data with high probability (attribute linkage). This SA information could be a SA value (eg. disease in Table 1) or a range that contains the SA value (eg. medical cost range). • Membership disclosure: It occurs when an adversary successfully infers the existence of a targeted victim in the published data with high probability. For example, the inference of an individual in a Covid-19-positive database poses a privacy threat to the individual.

Attack models
Privacy attacks could be launched by matching a published table containing sensitive information about the target victim with some external resources modelling the background knowledge of the attacker. For a successful attack, an adversary may require the following prior knowledge: • The published Generally, privacy attacks could be launched due to the linkability properties of the QID. Now, we discuss the relevant privacy attack models for identity and attribute disclosure.
•  Table 3, the probability of having mental illness is 33.3%, which is much higher than that of real distribution (11.1% in Table 1). Thus, this imposes a privacy threat that, anyone in the equivalence class have 33.3% possibility of being inferred to have mental illness, as compared with 11.1% of the overall distribution.

Privacy and utility objective of PPDP
PPDP allows computational guarantees on the prevention of privacy disclosures while maintaining the usefulness of the published data. From the privacy aspect, the identity of patients and their corresponding SA values should be concealed from the public. For instance, it is permissible to disclose the information that there exist diabetic patients in the hospital, but the published data should not disclose which patients have diabetes. Utility preservation is another aspect of PPDP, which emphasizes publishing data that is "almost similar" to the original data. Given that M is an arbitrary data mining process, the output of M(T) and M( ′ ) should be almost similar: the difference between M(T) and M( ′ ) should be less than a threshold t. In most PPDP scenarios, the data mining process M (the usage of the published data) is unknown at the time of publication. This process M could be a simple census statistic or some specify analysis and data exploration, such as pattern mining, association rules and data modelling. Privacy and utility are two contradictory aspects: publishing a high utility data implies less privacy protection to the record owner and vice versa.

Privacy models
In this section, we present some well-established privacy models that are used to ensure privacy in healthcare data. Particularly, we focus on data anonymization and differential privacy as two mainstream PPDP technologies which are different in their data publishing mechanisms. Fig. 2 shows a data publishing scenario in data anonymization. An original database is modified before being published as an anonymized database, which is generated by deploying generalization and suppression on the original database. The anonymized database could be studied in place of the original database. Some common data anonymization models to prevent privacy disclosure include k-anonymity [10][11][12][13][14], l-diversity [9], t-closeness [15] and -presence [16]. a) k-anonymity: k-anonymity was developed to address identity disclosure. It requires that, for one record in the table that has some QID value, there exists at least k-1 other records in the table that have the same QID value. Hence, each record is indistinguishable from at least k-1 other records with respect to the QID value in a k-anonymous table. For example, Table  2 and 3 are 3-anonymous tables. In k-anonymity, any individual cannot be reidentified from the published data with a probability of higher than 1/k. Other variations of k-anonymity include clustering anonymity [11], distribution-preserving k-anonymity [12], optimizationbased k-anonymity [13], -sensitive-k-anonymity [14], (X,Y)-anonymity [17], (α, k)anonymity [18], LKC-privacy [19] and random k-anonymous [20] which prevent identity disclosure by hiding the record of a target in an equivalence class of records with the same QID values. Although k-anonymity model protects against identity disclosure, it is vulnerable against attribute disclosure. Homogeneity attack and background knowledge attack is possible by deducing the sensitive attribute values from the published data. To provide protection on the sensitive attribute value, l-diversity and t-closeness were proposed.

Data anonymization
b) l-diversity: l-diversity requires every QID group to contain at least l distinct sensitive attribute values. For example, Table 3 is a 3-diverse table where there are at least 3 distinct sensitive attribute values for every QID group. This method depends on the range of the sensitive attribute values. If the number of distinct sensitive attribute values is lower than the desired privacy parameter l, some fictitious data are added to achieve l-diversity. This further leads to excessive modification and may produce biased results in statistical analysis. In addition, l-diversity does not prevent attribute disclosure when the overall distribution of the sensitive attribute is skewed. Skewness attack and similarity attack are still possible to disclose the SA values in l-diversity. k-anonymity and l-diversity were combined to propose -safe (l, k)-diversity [21].
c) t-closeness: To address these vulnerabilities, t-closeness was proposed, which requires that the distribution of a sensitive attribute in any equivalence class to be close to the distribution of the attribute in the overall table. That is, the distance between the distributions is less than a threshold. This property prevents an adversary from making an accurate estimation of the sensitive attribute values and thus preventing attribute disclosure. However, only SA values are modified while all the QID values remain unchanged in this model. Hence, it does not prevent identity disclosure. Furthermore, t-closeness deployed brute-force approach to examine each possible partition of the table to find the optimal solution. This process takes an enormous computation time complexity of 2 ( ) ( ) . d) -presence: To address membership disclosure, -presence was proposed to limit the confidence level of an adversary in inferring the existence of a targeted victim in the published data to at most %.
There is a significant amount of precedent for different parameter value of the privacy models, which could be used as benchmarks for efficient data publishing. However, the choice of the privacy parameter value is flexible and depends on the desired privacy and utility objectives of the data publication, provided that "an acceptable privacy level" is guaranteed. Fig. 3 shows a data publishing scenario in differential privacy. Differential privacy [22][23][24][25] involves a query answering process, which a data recipient may ask a query to the database and the result of that query is probabilistically indistinguishable regardless of the presence of a record in the database. That is, given two databases that differ in exactly one record, a differentially private mechanism provides two randomized outputs that have almost similar probability distributions. In other words, an adversary could not infer the existence of a targeted victim in the published database with high probability. Randomized noise derived from Laplace distribution is added to the result of the query to achieve privacy. This is a stronger privacy-enhancing technique that addresses all privacy invulnerabilities data anonymization approaches and it makes no assumption about the background knowledge of any potential adversary. However, it has some privacy and utility limitations. Firstly, the original data could be estimated with high accuracy from repeated queries. If an adversary performs a series of repeated differential privacy queries (k times) on a published database, then the original data could be disambiguated with high probability. Hence, Laplace noises must be injected k times to guarantee that the published data is invulnerable against k times of such queries. When k is large, the utility of the published data is degraded significantly. In a differentially private database, a maximum of q times queries is allowed to ask the database. This parameter q is called the privacy budget. The privacy of a database cannot be guaranteed if more than q times queries are made to the database. Thus, the database would stop answering further queries and provide no data utility after q times of queries.

Differential privacy
Differential privacy preserves utility for low-sensitivity queries such as counting, range and predicate queries, as the presence or absence of a single record changes the result slightly by one. However, a differentially private database could provide extremely inaccurate results for high-sensitivity queries. Examples of high-sensitivity queries include computation of sum, maximum, minimum, averages and correlation. Hence, a differentially private database is expected to provide highly biased results for more complex queries, such as computation of variance, skewness and kurtosis.

Conclusion
Although healthcare data provide enormous opportunities to various domains, preserving privacy in healthcare data still poses several unsolved privacy and utility challenges. In this paper, we have provided a general overview of healthcare data publishing problems and discussed the state-of-the-art in data anonymization and differential privacy. We highlighted the practical strengths and limitations of these two privacy-enhancing technologies.
For future research direction, it may be of interest to develop a standardization of privacy protection for privacy policy compliance as one of the subjects of future research. Healthcare data holders are required to comply with a number of privacy policies to protect the privacy of a user. This may require the data holders to install systems and processes in place to maintain compliance. However, there is no clear indication of which privacy model and protection level should be adopted. In addition, what constitutes to "an acceptable privacy level" is not explicitly or clearly defined in any current privacy laws. Furthermore, it is of interest to design a privacy model that considers data publication in a distributed and dynamic environment, where there are multiple data holders who publish their data independently to a data pool with the possibility of data overlapping. The problem is on how to anonymize and analyze the aggregated data that consists of anonymized data from each publisher. Furthermore, data are collected and published continuously in a dynamic EHR system (such as wearable healthcare devices). The information contained in profiles could be updated from time to time and required to be reflected in the anonymized data.