Privacy-preserving data mining

From CryptoWiki
Revision as of 14:57, 13 December 2015 by 15-01-AshkerovVN (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Introduction

Preserving privacy is nearly ubiquitous in various informatics disciplines, including but not limited to bioinformatics, homeland security, and financial analysis. It influences cybersecurity significantly with the recent development of information collection and dissemination technologies. The unlimited explosion of new information through the Internet and other media have inaugurated a new era of research where data-mining algorithms should be considered from the viewpoint of privacy preservation, called privacy-preserving data mining (PPDM). The ubiquitous applications of data-mining and machine-learning algorithms allow malicious users to employ data mining to obtain private information and, hence, raises the following questions: will data mining compromise privacy and should data mining be limited in some applications. This concern can be addressed from two aspects: ethical and technological. Legitimate use of private data would benefit the data-mining users and private owners. PPDM reduces unauthorized access of private information, while retaining the same functions as a normal data-mining method for discovering useful knowl¬edge. Privacy-preserving methods generally alter the integrity of data, so that the generally employed data-mining methods cannot discover the same knowl¬edge from the modified data as completely and correctly as from the original data.

Defining privacy preservation in data mining

Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the party running the algorithm. The main consideration of PPDM is two fold[1]. First, sensitive raw data like identifiers,names, addresses and so on, should be modified or trimmed out from the original database, in order for the recipient of the data not to be able to compromise another person’s privacy. Second, sensitive knowledge which can be mined from a database by using data mining algorithms, should also be excluded, because such a knowledge can equally well compromise data privacy. So, privacy preservation occurs in two major dimensions: users’ personal information and information concerning their collective activity. the former is referred to individual privacy preservation and the latter is referred to collectively privacy preservation.

Individual privacy preservation

The primary goal of data privacy is the protection of personally identifiable information. In general, information is considered personally identifiable if it can be linked, directly or indirectly, to an individual person. Thus, when personal data are subjected to mining, the attribute values associated with individuals are private and must be protected from disclosure. Miners are then able to learn from global models rather than from th e characteristics of a particular individual.

Collective privacy preservation

Protecting personal data may not be enough. Sometimes, we may need to protect against learning sensitive knowledge representing the activities of a group. We refer to the protection of sensitive knowledge as collective privacy preservation. The goal here is quite similar to that one for statistical databases, in which security control mechanisms provide aggregate information about groups and, at the same time, should prevent disclosure of confidential information about individuals. However, unlike as is the case for statistical databases, another objective of collective privacy preservation is to preserve strategic pattern that are paramount for strategic decisions, rather than minimizing the distortion of all statistics. In other words, the goal here is not only to protect personally identifiable information but also some patterns and trends that are not supposed to be discovered.

Data distribution

In PPDM data available for mining are centralized or distributed across many sites. With distributed data, the way the data is distributed also plays an important role in defining the problem. The different partitioning poses different problems and can lead to different algorithms for privacy-preserving data mining. Distributed data scenarios can be classified as horizontal data distribution and vertical data distribution[1]. Horizontal distribution refers to these cases where different database records reside in different places, while vertical data distribution , refers to the cases where all the values for different attributes reside in different places.

Models of PPDM

In the study of privacy-preserving data mining (PPDM), there are mainly four models as follows:

  • Trust Third Party Model

The goal standard for security is the assumption that we have a trusted third party to whom we can give all data. The third party performs the computation and delivers only the results – except for the third party, it is clear that nobody learns anything not inferable from its own input and the results. The goal of secure protocols is to reach this same level of privacy preservation, without the problem of finding a third party that everyone trusts.

  • Semi-honest Model

In the semi-honest model, every party follows the rules of the protocol using its correct input, but after the protocol is free to use whatever it sees during execution of the protocol to compromise security.

  • Malicious Model

In the malicious model, no restrictions are placed on any of the participants. Thus any party is completely free to indulge in whatever actions it pleases. In general, it is quite difficult to develop efficient protocols that are still valid under the malicious model. However, the semi-honest model does not provide sufficient protection for many applications.

  • Other Models - Incentive Compatibility

While the semi-honest and malicious models have been well researched in the cryptographic community, other models outside the purview of cryptography are possible. One example is the interesting economic notion of incentive compatibility. A protocol is incentive compatible if it can be shown that a cheating party is either caught or else suffers an economic loss. Under the rational model of economics, this would serve to ensure that parties do not have any advantage by cheating. Of course, in an irrational model, this would not work. We remark, in the “real world”, there is no external party that can be trusted by all parties, so the Trust Third Party Model is an ideal model.

Evaluation of privacy preserving algorithms

An important aspect in the development and assessment of algorithms and tools, for privacy preserving data mining is the identification of suitable evaluation criteria and the development of related benchmarks. It is often the case that no privacy preserving algorithm exists that outperforms all the others on all possible criteria. Rather, an algorithm may perform better that another one on specific criteria, such as performance and/or data utility. It is thus important to provide users with a set of metrics which will enable them to select the most appropriate privacy preserving technique for the data at hand, with respect to some specific parameters they are interested in optimizing. A preliminary list of evaluation parameters to be used for assessing the quality of privacy preserving data mining algorithms, is given below:

  • Privacy level offered by a privacy preserving technique, which indicates

how closely the sensitive information, that has been hidden, can still be estimated.

  • Hiding failure, that is, the portion of sensitive information that is not hidden by the application of a privacy preservation technique;
  • Data quality after the application of a privacy preserving technique, considered both as the quality of data themselves and the quality of the data mining results after the hiding strategy is applied;
  • Complexity, that is, the ability of a privacy preserving algorithm to execute with good performance in terms of all the resources implied by the algorithm.

Metrics for Quantifying Privacy Level

Additive-Noise-based Perturbation Techniques

The basic idea of the additive-noise-based perturbation technique is to add random noise to the actual data. In [2], Agrawal and Srikant uses an additive-noise-based technique to perturb data. They then estimate the probability distribution of original numeric data values in order to build a decision tree classifier from perturbed training data. They introduce a quantitative measure to evaluate the amount of privacy offered by a method and evaluate the proposed method against this measure. The privacy is measured by evaluating how closely the original values of a modified attribute can be determined. In particular, if the perturbed value of an attribute can be estimated, with a confidence c, to belong to an interval [a,b], then the privacy is estimated by (b−a) with confidence c. However, this metric does not work well because it does not take into account the distribution of the original data along with the perturbed data. Therefore, a metric that considers all the informative content of data available to the user is needed. Agrawal and Aggarwal [3] address this problem by introducing a new privacy metric based on the concept of information entropy. More specifically, they propose an Expectation Maximization (EM) based algorithm for distribution reconstruction, which converges to the maximum likelihood estimate of the original distribution on the perturbed data. The measurement of privacy given by them considers the fact that both the perturbed individual record and the reconstructed distribution are available to the user as well as the perturbing distribution. This metric defines the average conditional privacy of an attribute A given other information, modeled with a random variable B, as:

Val1.png

where h(A|B) is the conditional differential entropy of A given B representing a measure of uncertainty inherent in the value of A, given the value of B. Finally, we introduce a universal measure of data privacy level, proposed by Bertino et al. in [4]. The basic concept used by this measure is information entropy, which is defined: let X be a random variable which takes on a finite set of values according to a probability distribution p(x). Then, the entropy of this probability distribution is defined as follows:

Val2.png

or, in the continuous case:

Val3.png

where f(x) denotes the density function of the continuous random variable x. Information entropy is a measure of how much “choice” is involved in the selection of an event or how uncertain we are of its outcome. It can be used for quantifying the amount of information associated with a set of data. The concept of “information associated with data” can be useful in the evaluation of the privacy achieved by a PPDM algorithm. Because the entropy represents the information content of a datum, the entropy after data sanitization should be higher than the entropy before the sanitization. Moreover the entropy can be assumed as the evaluation of the uncertain forecast level of an event which in our context is evaluation of the right value of a datum. Consequently, the level of privacy inherent in an at tribute X, given some information modeled by Y, is defined as follows:

Val4.png

The privacy level defined in equation 3 is very general. In order to use it in the different PPDM contexts, it needs to be refined in relation with some characteristics like the type of transactions, the type of aggregation and PPDM methods. In [4], an example of instantiating the entropy concept to evaluate the privacy level in the context of “association rules” is presented. However, it is worth noting that the value of the privacy level depends not only on the PPDM algorithm used, but also on the knowledge that an attacker has about the data before the use of data mining techniques and the relevance of this knowledge in the data reconstruction operation. This problem is underlined, for example, in [5]. In [4], this aspect is not considered, but it is possible to introduce assumptions on attacker knowledge by properly modeling Y.

Multiplicative-Noise-based Perturbation Techniques

According to [6], additive random noise can be filtered out using certain signal processing techniques with very high accuracy. To avoid this problem, random projection-based multiplicative perturbation techniques has been proposed in [7]. Instead of adding some random values to the actual data, random matrices are used to project the set of original data points to a randomly chosen lower-dimensional space. However, the transformed data still preserves much statistical aggregates regarding the original dataset so that certain data mining tasks (e.g., computing inner product matrix, linear classification, K-means clustering and computing Euclidean distance) can be performed on the transformed data in a distributed environment(data are either vertically partitioned or horizontally partitioned) with small errors. In addition, this approach provides a high degree of privacy regarding the original data. As analyzed in the paper, even if the random matrix (i.e., the multiplicative noise) is disclosed, it is impossible to find the exact values of the original dataset, but finding approximation of the original data is possible. The variance of the approximated data is used as privacy measure. Oliveira and Zaiane [8] also adopt a multiplicative-noise-based perturbation technique to perform a clustering analysis while ensuring at the same time privacy preservation. They have introduced a family of geometric data transformation methods where they apply a noise vector to distort confidential numerical attributes. The privacy ensured by such techniques is measured as the variance difference between the actual and the perturbed values. This measure is given by Var(X−Y), where X represents a single original attribute and Y the distorted attribute. This measure can be made scale invariant with respect to the variance of X by expressing security as Sec=Var(X−Y)/Var(X).

K-Anonymization Techniques

The concept of k-anonymization is introduced by Samarati and Sweeney in [9]. A database is k-anonymous with respect to quasi-identifier attributes (a set of attributes that can be used with certain external information to identify a specific individual) if there exist at least k transactions in the database having the same values according to the quasi-identifier attributes. In practice, in order to protect sensitive dataset T, before releasing T to the public, T is converted into a new dataset T∗ that guarantees the k-anonymity property for a sensible attribute by performing some value generalizations on quasi-identifier attributes. The refore, the degree of uncertainty of the sensitive attribute is at least 1/k.

Differentially private mechanisms

In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily random. One of these, like the Laplace mechanism, described below, rely on adding controlled noise. Differential privacy is a definition, not an algorithm. For a given computational task T and a given value of D11.png there will be many differentially private algorithms for achieving T in an D11.png-differentially private manner. Some will have better accuracy than others. When D11.png is small, finding a highly accurate D11.png-differentially private algorithm for T can be difficult, much as finding a numerically stable algorithm for a specific computational task can require effort.

The Laplace mechanism

Many differentially private methods add controlled noise to functions with low sensitivity. The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function D1.png, which has mean zero and standard deviation D2.png). Now in our case we define the output function of D3.png as a real valued function (called as the transcript output by D3.png) as D4.png where D5.png and D6.png is the original real valued query/function we planned to execute on the database. Now clearly D7.png can be considered to be a continuous random variable, where:

D8.png

which is at most D9.png. We can consider D10.png to be the privacy factor D11.png. Thus D12.png follows a differentially private mechanism. If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have D3.png as the D11.png-differential private algorithm we need to have D13.png. Though we have used Laplacian noise here but we can use other forms of noises which also allows to create a differentially private mechanism, such as the Gaussian Noise.

Cryptography-based Techniques

The cryptography-based technique usually guarantees very high level of data privacy. In [10], Kantarcioglu and Clifton address the problem of secure mining of association rules over horizontally partitioned data, using cryptographic techniques to minimize the information shared. Their solution is based on the assumption that each party first encrypts its own itemsets using commutative encryption, then the already encrypted itemsets of every other party. Later on, an initiating party transmits its frequency count, plus a random value, to its neighbor, which adds its frequency count and passes it on to other parties. Finally, a secure comparison takes place between the final and initiating parties to determine if the final result is greater than the threshold plus the random value.

Computing association rules without disclosing individual transactions is straightforward. We can compute the global support and confidence of an association rule AB⇒C knowing only the local supports of AB and ABC, and the size of each database:

C1.png


Another cryptography-based approach is described in [11]. Such approach addresses the problem of association rule mining in vertically partitioned data. In other words, its aim is to determine the item frequency when transactions are split across different sites, without revealing the contents of individual transactions. The security of the protocol for computing the scalar product is analyzed. Though cryptography-based techniques can well protect data privacy, they may not be considered good with respect to other metrics like efficiency that will be discussed in later sections.

Metrics for Quantifying Hiding Failure

The percentage of sensitive information that is still discovered, after the data has been sanitized, gives an estimate of the hiding failure parameter. Most of the developed privacy preserving algorithms are designed with the goal of obtaining zero hiding failure. Thus, they hide all the patterns considered sensitive. However, it is well known that the more sensitive information we hide, the more non-sensitive information we miss. Thus, some PPDM algorithms have been recently developed which allow one to choose the amount of sensitive data that should be hidden in order to find a balance between privacy and knowledge discovery. For example, in [12], Oliveira and Zaiane define the hiding failure (HF) as the percentage of restrictive patterns that are discovered from the sanitized database. It is measured as follows:

Val5.png

where #Rp(D) and #Rp(D′) denote the number of restrictive patterns discovered from the original data base D and the sanitized database D′respectively. Ideally, HF should be 0. In their framework, they give a specification of a disclosure threshold φ, representing the percentage of sensitive transactions that are not sanitized, which allows one to find a balance between the hiding failure and the number of misses. Note that φ does not control the hiding failure directly, but indirectly by controlling the proportion of sensitive transactions to be sanitized for each restrictive pattern. Moreover, it is important not to forget that intruders and data terrorists will try to compromise information by using various data mining algorithms. Therefore, a PPDM algorithm developed against a particular data mining techniques that assures privacy of information, may not attain similar protection against all possible data mining algorithms. In order to provide for a complete evaluation of a PPDM algorithm, we need to measure its hiding failure against data mining techniques which are different from the technique that the PPDM algorithm has been designed for. The evaluation needs the consideration of a class of data mining algorithms which are significant for our test. Alternatively, a formal framework can be developed that upon testing of a PPDM algorithm against pre-selected data sets, we can transitively prove privacy assurance for the whole class of PPDM algorithms.

Metrics for Quantifying Data Quality

The main feature of the most PPDM algorithms is that they usually modify the database through insertion of false information or through the blocking of data values in order to hide sensitive information. Such perturbation techniques cause the decrease of the data quality. It is obvious that the more the changes are made to the database, the less the database reflects the domain of interest. Therefore, data quality metrics are very important in the evaluation of PPDM techniques. Since the data is often sold for making profit, or shared with others in the hope of leading to innovation, data quality should have an acceptable level according also to the intended data usage. If data quality is too degraded, the released database is useless for the purpose of knowledge extraction. In existing works, several data quality metrics have been proposed that are either generic or data-use-specific. However, currently, there is no metric that is widely accepted by the research community. Here we try to identify a set of possible measures that can be used to evaluate different aspects of data quality. In evaluating the data quality after the privacy preserving process, it can be useful to assess both the quality of the data resulting from the PPDM process and the quality of the data mining results. The quality of the data themselves can be considered as a general measure evaluating the state of the individual items contained in the database after the enforcement of a privacy preserving technique. The quality of the data mining results evaluates the alteration in the information that is extracted from the database after the privacy preservation process, on the basis of the intended data use.

Quality of the Data Resulting from the PPDM Process

The main problem with data quality is that its evaluation is relative, in that it usually depends on the context in which data are used. In particular, there are some aspects related to data quality evaluation that are heavily related not only with the PPDM algorithm, but also with the structure of the database, and with the meaning and relevance of the information stored in the database with respect to a well defined context. In the scientific literature data quality is generally considered a multi-dimensional concept that in certain contexts involves both objective and subjective parameters. Among the various possible parameters, the following ones are usually considered the most relevant:

  • Accuracy: it measures the proximity of a sanitized value to the original value.
  • Completeness: it evaluates the degree of missed data in the sanitized database.
  • Consistency: it is related to the internal constraints, that is, the relationships that must hold among different fields of a data item or among data items in a database.

Accuracy

The accuracy is closely related to the information loss resulting from the hiding strategy: the less is the information loss, the better is the data quality. This measure largely depends on the specific class of PPDM algorithms. In what follows, we discuss how different approaches measure the accuracy. As for heuristic-based techniques, we distinguish the following cases based on the modification technique that is performed for the hiding process. If the algorithm adopts a perturbation or a blocking technique to hide both raw and aggregated data, the information loss can be measured in terms of the dissimilarity between the original dataset D and the sanitized one D′. In [12], Oliveira and Zaiane propose three different methods to measure the dissimilarity between the original and sanitized databases. The first method is based on the difference between the frequency histograms of the original and the sanitized databases. The second method is based on computing the difference between the sizes of the sanitized database and the original one. The third method is based on a comparison between the contents of two databases. A more detailed analysis on the definition of dissimilarity is presented by Bertino et al. in [4]. They suggest to use the following formula in the case of transactional dataset perturbation:

Q1.png

where i is a data item in the original database D and Q2.png is its frequency within the database, whereas i’ is the given data item after the application of a privacy preservation technique and Q3.png is its new frequency within the transformed database D′. As we can see, the information loss is defined as the ratio between the sum of the absolute errors made in computing the frequencies of the items from a sanitized database and the sum of all the frequencies of items in the original database. This formula can also be used for the PPDM algorithms which adopt a blocking technique for inserting into the dataset uncertainty about some sensitive data items or their correlations. The frequency of the item i belonging to the sanitized dataset D′ is then given by the mean value between the minimum frequency of the data item i, computed by considering all the blocking values associated with it equal to zero, and the maximum frequency, obtained by considering all the question marks equal to one.

The next metric, classification metric (CM), is introduced to optimize a k-anonymous dataset for training a classifier. It is defined as the sum of the individual penalties for each row in the table normalized by the total number of rows N.

Q4.png

The penalty value of row r is 1, i.e., row r is penalized, if it is suppressed or if its class label is not the majority class label of its group. Otherwise, the penalty value of row r is 0. This metric is particularly useful when we want to build a classifier over anonymous data. Another interesting metric is the discernibility metric(DM) . This discernibility metric assigns a penalty to each tuple based on how many tuples in the transformed dataset are indistinguishable from it. Let t be a tuple from the original table T, and let Q5.png be the set of tuples in an anonymized table Q9.png indistinguishable from t or the set of tuples in Q9.png equivalent to the anonymized value of t. Then DM is defined as follows:

Q6.png

Note that if a tuple t has been suppressed, the size of Q5.png is the same as the size of Q9.png. In many situation, suppressions are considered to be most expensive in the sense of information loss. Thus, to maximize data utility, tuple suppression should be avoided whenever possible.

Completeness and Consistency

While the accuracy is a relatively general parameter in that it can be measured without strong assumptions on the dataset analyzed, the completeness is not so general. For example, in some PPDM strategies, e.g. blocking, the completeness evaluation is not significant. On the other hand, the consistency requires to determine all the relationships that are relevant for a given dataset. In [13], Bertino et al. propose a set of evaluation parameters including the completeness and consistency evaluation. Unlike other techniques, their approach takes into account twomore important aspects: relevance of data and structure of database. They provide a formal description that can be used to magnify the aggregate information of interest for a target database and the relevance of data quality properties of each aggregate information and for each attribute involved in the aggregate information. Specifically, the completeness lack (denoted as CML) is measured as follows:

Q7.png

In this equation, DMG is an oriented graph where each node Ni is an attribute class. CV is the completeness value and CW is the consistency value. The consistency lack (denoted as CSL) is given by the number of constraint violations occurred in all the sanitized transaction multiplied by the weight associated with every constraints.

Q8.png

In this equation , csv indicates the number of violations, cw is the weight of the constraint, SCi describes a simple constraint class, and CCj describes a complex constraint class.


The Sanitization Algorithms

Sanitizing algorithms [12] for transactional databases can be classified into two classes: the algorithms that solely remove information from the transactional database and those that modify existing information. The first algorithms only reduce the support of some items, while the second may increase the support of some items.

The Naive Algorithm

The main idea behind the Naive Algorithm is to select all items in a given restrictive pattern as victims. The rationale behind this selection is that by removing from the sensitive transactions the items of a restrictive pattern, such a pattern will be hidden. If a sensitive transaction contains exactly the same items as a restrictive pattern, the Naive Algorithm removes all items of this transaction except for the item with the highest frequency in the database. Because one item must be kept, the number of transactions is not modified. Selecting the sensitive transactions to sanitize is based simply on their degree of conflict. Given the number of sensitive transactions to alter, based on 1a.png, this approach selects for each restrictive pattern the transactions with the smallest degree of conflict. The rationale is, as above, to minimize the impact of the sanitization on the discovery of the legitimate patterns.

Input:2q.png

Output:D'

Step 1. For each restrictive pattern 3q.png do

1. 4q.png Find _Sensitive_Transactions 5q.png;

Step 2. For each restrictive pattern 3q.png do

1. 6q.png such that 7q.png

Step 3. For each restrictive pattern 3q.png do

1. 8q.png Is the number of sensitive transaction for 9q.png

Step 4. 10q.png

For each restrictive pattern 3q.png do

1. Sort_Transactions(11q.png); //in ascending order of degree of conflict

2. TransToSanitize 12q.png Select first NumTrans9q.png 10q.png transactions from 11q.png

3. In D’ foreach transaction 13q.png do

3.1. 14q.png

End

The Minimum Frequency Item Algorithm

The main idea behind the Minimum Frequency Item Algorithm, denoted by MinFIA, is to select as a victim item, for a given restrictive pattern, the restrictive pattern item with the smallest support in the pattern. The rationale behind this selection is that by removing the item from the sensitive transactions with the smallest support will have the smallest impact on the database and the legitimate patterns to be discovered. Selecting the sensitive transactions to sanitize is simply based on their degree of conflict. Given the number of sensitive transactions to alter, based on 1a.png, this approach selects for each restrictive pattern the transactions with the smallest degree of conflict. The rationale is, as above, to minimize the impact of the sanitization on the discovery of the legitimate patterns. The sketch of the Minimum Frequency Item Algorithm is given as follows:

Input:2q.png

Output:D'

Step 1. For each restrictive pattern 3q.png do

1. 4q.png Find _Sensitive_Transactions 5q.png;

Step 2. For each restrictive pattern 3q.png do

1. 6q.png such that 7q.png and Z1.png

Step 3. For each restrictive pattern 3q.png do

1. 8q.png Is the number of sensitive transaction for 9q.png

Step 4. 10q.png

For each restrictive pattern 3q.png do

1. Sort_Transactions(11q.png); //in ascending order of degree of conflict

2. TransToSanitize 12q.png Select first NumTrans9q.png 10q.png transactions from 11q.png

3. In D’ foreach transaction 13q.png do

3.1. 14q.png

End

Privacy Preserving Clustering [14]

Algorithm (k-means clustering)

begin initialize n, c, 1x.png

do classify n samples according to nearest 2x.png, and recompute 2x.png

until no change in 2x.png's

return 1x.png

end

A sample Xi.png is deemed to be in the cluster j if it is closest to the mean 2x.png, where mean of a cluster Xi1.png is Xi2.png. Distance between two m-dimensional vectors x and y is given by Xi3.png,where x[j] is the j-th element of the vector x.

Privacy-preserving k-means

In order, to create a privacy-preserving version of k-means that does not use a trusted third party TTP we have to devise a privacy-preserving protocol to compute the cluster means. Consider the computation of a single cluster mean 2x.png. Recall that in distributed k-means each party sends 3x.png and 4x.png to the TTP, which computes 5x.png; this is precisely the function for which we have to devise a privacy-preserving protocol. This problem can be formally defined as follows: The weighted average problem (WAP) is defined as follows: party 1 has a pair (x, n), where x is a real number and n is a positive integer. Similarly, party 2 has pair (y, m). They want to jointly compute 6x.png. In other words, we need a privacy-preserving protocol for the following functionality:

7x.png

The notation shown above means that the first and second party provide inputs (x, n) and (y, m) to the protocol and both parties receive output 6x.png. Notice that WAP is different than the classical problem of computing the averages, where n parties have a number and they jointly want to compute the average without revealing their individual numbers. In the classical problem, the number of parties n is known to all the parties. In WAP, the number of points n and m needs to be kept secret.

Privacy-Preserving Protocol for the Weighted Average Problem

Protocol based on oblivious polynomial evaluation [14]

We will first give a privacy-preserving protocol for a general problem, and then at the end of the subsection demonstrate how we can construct a privacy-preserving protocol for WAP. Consider the following problem.

Let F be a finite field. Party 1 has two polynomials P and Q with coefficients in F. Party 2 has two points α and β in F. Both parties want to compute P(α)/Q(β). In other words, we want to privately compute the following functionality:

8x.png

We call this problem private rational polynomial evaluation (PRPE).

Let F be a finite field. The oblivious polynomial evaluation or OPE problem can be de fined as follows: Alice A has a polynomial P over the finite field F, and Bob B has an element x∈F. After executing the protocol implementing OPE B should only know P(x)and A should know nothing.

Protocol for PRPE

  • (Step 1) Party 1 picks a random element z∈F and computes two new polynomials zP and zQ. In other words, party 1 “blinds” the polynomials P and Q.
  • (Step 2) Party 2 computes zP(α) and zQ(α) by invoking the protocol for OPE(POPE) twice, i.e., invokes the protocol POPE(zP,α) and POPE(zQ,β).
  • (Step 3) Party 2 computes P(α)/Q(β) by computing zP(α)/ zQ(β) and sends it to party 1.

Protocol based on homomorphic encryption [14]

Let (G, E, D, M) be a encryption scheme (where G is the function to generate public parameters, E and D are the encryption and decryption functions, and M is the message space respectively) with the following properties:

  • The encryption scheme (G, E, D) is semantically secure. Essentially, an encryption scheme is semantically secure if an adversary gains no extra information by inspecting the ciphertext.
  • For all m∈M and α∈M, 10x.png∈E(m) implies that 9x.png∈E(mα). Encrypting the same message twice in a probabilistic encryption function can yield a different ciphertext, so E(m) denotes the set of ciphertexts that can be obtained by encrypting m.
  • There is a computable function f such that for all messages 10x.png and 12x.png the following property holds:

11x.png

In our implementation, we used the dense probabilistic encryption (DPE) scheme of Benaloh. The semantic security of the scheme provided by Benaloh is based on the intractability of deciding prime residuosity.

Party 1 and 2 have a pair of messages (x, n) and (y, m). The two parties want to jointly compute 6x.png in a privacy-preserving way. Assume that party 1 sets up a probabilistic encryption scheme (G, E, D, M), and publishes the public parameters G. We also assume that the probabilistic encryption scheme (G, E, D, M) satisfies the three properties given at the beginning of the section.

Protocol for WAP based on homomorphic encryption

  • (Step 1) Party 1 encrypts x and n and sends the encrypted values X1.png∈E(x) and N1.png∈E(n) to party 2.
  • (Step 2) Party 2 computes a random message z∈M, and encrypts z·y and z·m to obtain Z1a.png∈E(z·y) and Z2.png∈E(z·m). Party 2 computes the following two messages and sends it to party 1:

Z3.png

  • (Step 3) Using the two properties of the probabilistic encryption scheme (G, E, D),we have the following:

Z4.png

Therefore, party 1 can compute z(x+y) and z(n+m), and hence can compute 6x.png. Party 1 sends 6x.png to party 2.

Glossary

Database

Confidentiality of Information

Cryptography

Laplace distribution

Gaussian Noise

Bibliography

Move to bibliography for "Privacy-preserving data mining" article.

Move to Main Page

Ashkerov V., 2015