Differential privacy

From CryptoWiki
Jump to: navigation, search

Contents

What is differential privacy

Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population.

Consider a trusted party holds a dataset of sensitive private information (for example, email usage, movie viewing, medical records) and it would like to provide global, statistical information about the data. Such a system is called a statistical database. Providing aggregate statistical information about the data may reveal some information about the individuals.

Differential privacy guarantees that data about individuals can not be obtained from such database, no matter what other data sets or information sources are available to the attacker.

It is achieved because database owner uses release mechanism (algorithm) by which presence or absence of information about an individual will not affect the final output of the request significantly.

Purposes of differential privacy

Differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.

The problem of privacy-preserving data analysis has a long history spanning multiple disciplines. As electronic data about individuals becomes increasingly detailed, and as technology enables ever more powerful collection and curation of these data, the need increases for a robust and mathematically rigorous definition of privacy, together with a computationally rich class of algorithms that satisfy this definition.

Various approaches to anonymizing public records have failed when researchers managed to identify personal information by linking two or more separately innocuous databases. Differential privacy is a framework for formalizing privacy in statistical databases introduced in order to protect against these kinds of deanonymization techniques.

Deanonymization of some statistical databases

Netflix database

In 2007, Netflix offered a $1 million prize for a 10% improvement in its recommendation system. Netflix also released a training dataset for the competing developers to train their systems. While releasing this dataset, they said that all personal information identifying individual customers has been removed and all customer identifiers have been replaced by randomly assigned identifiers.

There are many other movie-rating portals on the web, including IMDb. On IMDb individuals can register and rate movies and they have the option of not keeping their details anonymous. Researchers from the University of Texas linked the Netflix anonymized training database with the IMDb database to partially deanonymize the Netflix training database. [1]

Medical database in Massachusetts

Recearcher from Carnegie Mellon University linked the anonymized database which retained the birthdate, sex, and ZIP code of each patient with voter registration records, and was able to identify the medical record of the governor of Massachusetts. [2]

Mobility databases

Recearchers from MIT introduced the notion of unicity and showed that 4 spatio-temporal points, approximate places and times, are enough to uniquely identify 95% of 1.5 million people in a mobility database. The study further shows that these constraints hold even when the resolution of the dataset is low, meaning that even coarse or blurred mobility datasets and metadata provide little anonymity. [3]

Definition

Differential privacy

Let ε be a positive real number and A be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let imA denote the image of A. The algorithm A is ε-differentially private if for all datasets Prv d1.png and Prv d2.png that differ on a single element (i.e., the data of one person), and all subsets S of imA:

Prv1.png
,


where P is probability [4].

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

Sensivity

Let d be a positive integer, D be a collection of datasets, and there is function Prv f.png. The sensitivity of a function f, denoted Prv df.png, is defined by

Prv 3.png
,


where the maximum is over all pairs of datasets Prv d1.png and Prv d2.png in D differing in at most one element; Prv 4.png, where Prv pq.png. [5]

Example

For example, we have a database of medical records Prv d1.png where each record is a pair (Last name, X) and Tsk x en.png. Prv d1.png is shown in Table 1.

Table 1 — Database of medical records


Suppose a malicious user wants to find whether Taylor has allergy or not. And he also knows in which row of the database Taylor resides. Now suppose the adversary is only allowed to use a particular form of query Prv qi.png that returns the partial sum of the first i rows of column X. In order to find Taylor's X the adversary executes Prv q5.png and Prv q4.png, then computes their difference. Prv q5.pngPrv q4.png = 32 = 1. This indicates that the X field in Taylor's row must be 1.

If we construct Prv d2.png by replacing (Taylor, 1) на (Taylor, 0) then this malicious adversary will be able to distinguish Prv d2.png from Prv d1.png by computing Prv q5.pngPrv q4.png for each dataset. If the adversary were required to receive the values Prv qi.png via an ε-differentially private algorithm, then he would be unable to distinguish between Prv d1.png and Prv d2.png.

The sensitivity of the function Prv qi.png is 1, since changing any one of the entries in the database causes the output of the function to change by either zero or one.

Mechanisms for achieving differential privacy

in many cases, differential privacy can be achieved by adding random noise to the result. For example, you can inject noise from a Laplace or Gaussian distribution ,producing a result that’s not quite exact, but that masks the contents of any row.

There are other ways to ensure differential privacy of data. [6, 7]


A tradeoff between privacy and accuracy

The amount of information leakage from a single query can be bounded by a small value, but this value is not zero. Each time you query the database on some function, the total leakage increases. If you make more queries, this leakage can start to add up.

The total allowed leakage is often referred to as a privacy budget, and it determines how many queries will be allowed and how accurate the results will be. If you set it too high, you lose your sensitive data. Set it too low, and the answers you get might not be particularly useful.

  • Once you’ve leaked as much data as your calculations tell you is safe, you should stop database providing, because your users’ privacy can be violated.
  • The more information you intend to “ask” of your database, the more noise has to be injected in order to minimize the data leakage.

Tradeoff between accuracy and privacy can be a big problem when training complex machine learning models. The authors of [8] used machine learning techniques to develop a medicine dosing model. Thet set various privacy budgets while training the model. Then they evaluated both the information leakage and the model’s success at treating simulated patients.

The results showed that the model’s accuracy depends a lot on the privacy budget on which it was trained. If the budget is set too high, the database leaks a great deal of sensitive patient information — but the resulting model makes safe dosing decisions. When the budget was reduced to a level that achieved meaningful privacy, the model had a tendency to prescribe lethal doses.

Application of differential privacy idea

Some examples of using differential privacy in practice:

  • Google, for sharing traffic statistics.
  • Apple, in iOS 10 to improve intelligent assistance and suggestions technology.
  • U.S. Census Bureau, for showing statistics.

Vuvuzela

Vuvuzela (secure messaging system) uses ideas from differential privacy to prevent an adversary from learning which pairs of users are communicating. Vuvuzela ensures that any observation that an adversary can perform will be almost independent of whether some user is active or not.

Alpenhorn

Alpenhorn (secure private messaging system) uses differential privacy ideas to prevent an adversary from determining pairs of communicating users or users that are added to the contact list.

Glossary

Bibliography

Go to the list of references.

Sokolov A.

2016