Design of an Economic Model for Protectively Sharing Biomedical Data

Frequency (MAF) and regression coefficients can lead to privacy concerns [5]. This understanding has led various groups removing statistical data from public databases into access-controlled format. Though such protections help preserve privacy, they also have adverse effects on access to useful dataset for medical research. The medical community is at a crossroad; how can researchers access medical data to data to save lives and still ensure the privacy of the individuals involved in the datasets. An effective model for genomic data dissemination can be achieved through an approach based on game theory to account for adversarial behaviors and capabilities. The proposed approach has already been used to analyze the reidentification risk and proven effective in some risk inherent domains, such as airport security and coast guard patrols [6]. Methodologies are borrowed from game theory to develop an effective, measurable protections for genomic data sharing. This method accounts for adversarial behavior to balance risks against utility more effectively compared with traditional approaches. Abstract Sharing medical data (such as genomic) can lead to important discoveries in healthcare, but researches have shown that links between de-identified data and named persons are sometimes reestablished by users with malicious intents. Traditional approaches to curb this menace rely data use agreements, suppression and noise adding to protect the privacy of individual in the dataset, but this reduces utility of the data. Therefore, this paper proposed an economic game theoretic model design for quantifiable protections of genomic data. The model can be developed to find solution for sharing summary statistics under an economically motivated recipient’s (adversary) inference attack. The framework incorporates four main participants: Data Owners, Certified Institution (CI), Sharer and Researchers (Recipients). The data Sharer and Researcher (who are the players) are economically motivated.


Introduction
Using the massive amount of information encoded in the biomedical can have significant effects on personalized medicine, paternity testing and disease susceptibility testing. With data analyses, vital information about an individual can be revealed, for instance, disease susceptibility testing can determine if an individual is likely to have a specific disease such as breast cancer and diabetes or not [1]. In personalized medicine, a physician can prescribe a safe and effective medical treatment built on the patient's genetic profile to minimize side effects. The increasing number of large biomedical databases and electronic health records are vital resources for healthcare researchers. However recent works have shown that sharing this data when aggregated to produce p-values, regression coefficients, count queries, and Minor Allele Frequencies (MAFs) may cause compromise to patient privacy [2].
The question is; can patient privacy be protected while still making the most out of medical data? Extensive sharing and reuse of medical data are usually endorsed by many, but participants often expect that their privacy to be preserved. To achieve privacy, many organizations are adopting various legal protections, such as Data Use Agreements (DUAs) that explicitly eliminate re-identification (Paltoo et al.) and technical controls, such as the suppression or noise addition to genomic variants having a high likelihood of distinguishing an individual [3]. However, reports have shown that preservation of privacy might be impossible to realize, despite sharing of only summary statistics [4]. There are also reports over the past decade on how de-identified genomic data have been tracked back to named persons, leading to public apologies and dramatic policy changes (Shringarpure and Bustamante). Various genomic statistics such as Minor Allele Frequency (MAF) and regression coefficients can lead to privacy concerns [5]. This understanding has led various groups removing statistical data from public databases into access-controlled format. Though such protections help preserve privacy, they also have adverse effects on access to useful dataset for medical research. The medical community is at a crossroad; how can researchers access medical data to data to save lives and still ensure the privacy of the individuals involved in the datasets. An effective model for genomic data dissemination can be achieved through an approach based on game theory to account for adversarial behaviors and capabilities. The proposed approach has already been used to analyze the reidentification risk and proven effective in some risk inherent domains, such as airport security and coast guard patrols [6]. Methodologies are borrowed from game theory to develop an effective, measurable protections for genomic data sharing. This method accounts for adversarial behavior to balance risks against utility more effectively compared with traditional approaches.

Review of Related Works
There are many approaches in ensuring biomedical data privacy: Non-cryptographic and cryptographic approaches [7]. In this section, a brief summary of existing non-cryptographic techniques are presented.

Non-cryptographic approach
Non-cryptographic approaches adopt various sanitization techniques to ensure the privacy of genomic data. Privacy Preserving Data Publishing (PPDP) is a well-studied domain and has been researched extensively for various types of data. These techniques study how to transform raw data into a version that is immunized against privacy attacks but that still preserves useful information for data analysis. Existing techniques first sanitize raw data and then release the sanitized data for public use. Once shared, the data owner has no further control over the shared data. Existing techniques are primarily based on two major privacy models: k anonymity and ε-differential privacy. Despite its wide applicability in the healthcare domain, recent research results indicate that k anonymity-based techniques are vulnerable to an adversary's background knowledge [8]. This has inspired a discussion in the research community in favor of the ε-differential privacy model, which provides provable privacy guarantees independent of an adversary's background knowledge. However, it is not well understood whether differential privacy is the right privacy model for biomedical data as it fails to provide adequate data utility. To satisfy a specific privacy model, while many anonymization techniques have been proposed for various type of data; relational, set-valued, spatio-temporal data, the problem of genomic data anonymization has been little studied.
One of the limitations of the non-cryptographic approach is that there is a trade-off between privacy and utility. All the proposed methods compromise significant amount of utility while protecting privacy. Differentially private mechanism may provide wrong information due to noise addition. Therefore, cryptographic approach has recently received much attention as an alternative approach to protect genomic data privacy. [5] proved that de-identification is an ineffective way to protect the privacy of participants in genome-wide association studies, Recently, it has been shown how they identified DNAs of several individuals (and their families) who participated in scientific studies [9].
Several algorithms for inference on graphical models have been proposed in the context of pedigree analysis. Exact inference techniques on Bayesian networks are used in order to map disease genes and construct genetic maps. Monte Carlo methods (Gibbs sampling) were also proved to be efficient for genetic analyses in the case of complex pedigrees (Sheehan). All these methods aim to infer specific genotypes given phenotypes (like diseases). Another paper relies on Gibbs sampling in order to infer haplotypes (used in association studies) from genotype data [10]. Genotype imputation is another technique used by geneticists to complete missing SNPs based upon given genotyped data. A similar approach has recently been used to infer high-density genotypes in pedigrees, by relying notably on low-resolution genotypes and identity-by-descent regions of the genome [11]. None of these contributions addresses privacy. Johnson and Shmatikov proposed privacy-preserving algorithms for computing various statistics related to the SNPs, while guaranteeing differential privacy. However, differential privacy reduces the accuracy of research results and is aimed to be applied on aggregate results. In our work, we focus on protecting individual genomic data. Some works also focus on protecting the privacy of genomic data and on preserving utility in medical tests such as (ii) comparing the similarity of DNA sequences, [12].
[14] proposed privacy-preserving schemes for medical tests and personalized medicine methods that use patients' genomic data. For privacy-preserving clinical genomics, a group of researchers proposes to outsource some costly computations to a public cloud or semi-trusted service provider.

Preamble
This framework is designed to increase access to large-scale genomic data while promoting privacy using a game theoretic approach. Game models are built to formulate (formalize) the interactions among data owners and backward induction approach is proposed to find the Nash equilibria of the game. A model for genomic data dissemination and sharing is designed to account for adversarial behavior and capabilities. The model is used to demonstrate how a game theoretic approach can improve data sharing. Game theory is adapted for modeling interactions involved in genomic data sharing process to protect privacy. The decision of a data sharer is affected by several factors, including his personal privacy preference (e.g. whether he cares much about privacy), the incentives offered by the data collector, and the level of privacy protection that the data collector guarantees.

System Design Overview
Two actors are to play the game: SNP sharer who could be an investigator of a study or an organization, such as an academic medical center, and the recipient (or researcher), who would request to access the data for some purpose (for example., research purpose, findings or discovery of new associations).
The majority of recipients are unlikely to misuse the data, but the privacy concern is on those with the potential to exploit named genomes (or targets) by determining their presence in the dataset. In this model, the sharer is a leader who can a. require a DUA with liquidated damages in the event of a breach of contract and b. share a subset of SNP summary statistics from a specific study (suppressing the rest).
The recipient of the data then follows by determining whether the benefits gained by attacking each target outweigh the costs. Crucially, the sharer chooses the policy that optimally balances the anticipated utility and privacy risk. Figure 1 presents a general architecture of the proposed framework. The game is played between SNP sharer and researcher. As depicted in the Figure   DOs consist of the institutions that agreed to share genomic data (that is, study dataset (SD)) they possess. These institutions might be any academic institutions, non-academic research organizations, government research agency or health departments. They collect samples from donors, carry out DNA sequencing and provide the formerly agreed digitized format to CI. CIs are the institution that have legal authority to process the raw dataset from DOs before sending it to SNP sharer. The data shared by different data owners reside in a database owned by the trusted entity. Any government institution such as National Institute of Health (NIH) in United States, Nigerian Institute of Medical Research (NIMR) and African Center of Excellence for Genomics of Infectious Diseases (ACEGID) in Nigeria can play this role. The main responsibilities performed by CI are: SNP Filtering, MAF Computation and Privacy Protection (PP). PP is done by checking data resistance to one of the strongest re-identification statistical attack (likelihood ratio test).

SNP Sharer (Publisher)
The SNP Sharer (also known as publishers) are biomedical researchers who are disseminating research datasets. Funding organizations, such as the NIH, NIHR and ACEGID require researchers who are granted funding to publish the data generated by their research through websites such as the Database of Genotypes and Phenotypes (dbGaP) .However, while they need to share data, they also have an incentive to protect the identities of the individuals who participated in the original research (that is, ensure Privacy of data) while recipient is interested in the data utility. The benefit associated with publishing the research dataset can be correlated to the amount of funding received for the project. For example, consider the dataset in dbGaP submitted by the five separate member institutions of the NIH-sponsored Electronic Medical Records and Genomics Network (EMERGE) [15].

Researchers (Recipient)
Researchers might be any individual or organization who is interested in executing query on the aggregate shared data residing in the CS. To execute query on the outsourced data, researchers need to log in with registered password which is stored CI database.
The recipient is modeled as an intelligent attacker who can access external resources (called Target set) at a fixed cost to perform a reidentification attack, only attempts re-identification if his associated benefits exceed the costs (which can also include linking and curation costs) [16,17].

Conclusion
In this paper, a privacy-preserving technique for biomedical data using game theory has been proposed. The main contribution to data privacy research is the design of a model for representing major parties' interactions (Data Owners, Certified Institution, SNP Sharer and Researchers (Recipients)) involved in genomic data sharing. This is to ensure that sharing and dissemination activities are captured in order to protect privacy.