Order-preserving encryption

From CryptoWiki
Jump to: navigation, search

Contents

Background

Nowadays required to constantly enhance the existing data warehouse and to improve qualification of the personnel, or use a popular approach is to provide database management to an external service provider.

This concept has been named "database as a service" (DBaaS)[B02]. Its essence that on request the user is provided with a database that uses cloud computing platforms. The user acquires the possibility of storage and dostupan information from a remote database without having to configure, administer, and attract additional resources required to ensure scalability of the database. This approach allows you to organize dynamic services that enables users to pay per use resources and adjust their volume depending on actual needs without long term commitments.

"Database as service" is one of the most important trends in the field of information resources management. The main reason for the development of cloud services is the ability to reduce costs of companies and individuals to maintain their own IT-infrastructure due to the transfer of this work to the cloud service provider.

One of the unsolved problems of cloud services is the security of stored data. Before the consumer will trust their data to cloud service, you need to ensure the preservation of their privacy and to eliminate or minimize the risk of unauthorized access to information, including from the owner of the cloud service. The use of cryptographic algorithms to encrypt sensitive data and then stored in a DBMS is not always the solution. In the case of relational databases it is impossible to make processing of encrypted data on the DBMS side. In the General case there are two variants. The first variant is to make processing and decoding on the client side, which will lead a substantial increase in traffic between the DBMS and the application, since, in General, is not acceptable to filter the required data. The second variant is to decrypt the data on the DBMS side that is unsafe, if the DBMS or the cloud service are not trusted. In order to solve this problem deals with the possibility of application of special types of encryption that allows not only to safely store data but also to perform some operations on encrypted data.


Вернуться к оглавлению

Introduction

Order-preserving symmetric encryption (OPE) is a deterministic encryption scheme whose encryption function preserves numerical ordering of the plaintexts.

OPE has a long history in the form of one-part codes, which are lists of plaintexts and the corresponding ciphertexts, both arranged in alphabetical or numerical order so only a single copy is required for efficient encryption and decryption. One-part codes were used, for example, during World War I. A more formal treatment of the concept of order-preserving symmetric encryption (OPE) was proposed in the database community by Agrawal[AKSY04]. The reason for new interest in such schemes is that they allow efficient range queries on encrypted data. That is, a remote untrusted database server is able to index the (sensitive) data it receives, in encrypted form, in a data structure that permits efficient range queries (asking the server to return ciphertexts in the database whose decryptions fall within a given range, say [a; b]). By "efficient" we mean in time logarithmic (or at least sub-linear) in the size of the database, as performing linear work on each query is prohibitively slow in practice for large databases.

Order-preserving encryption scheme (OPES) allows comparison operations to be directly applied on encrypted data,without decrypting the operands. Thus, equality and range queries as well as the MAX, MIN, and COUNT queries can be directly processed over encrypted data. Similarly, GROUP BY and ORDER BY operations can also be applied. Only when applying SUM or AVG to a group do the values need to be decrypted.

OPES is also endowed with the following properties:

  • The results of query processing over data encrypted using OPES are exact. They neither contain any false positives nor miss any answer tuple.
  • OPES handles updates gracefully. A value in a column can be modified or a new value can be inserted in a column without requiring changes in the encryption of other values.
  • OPES can easily be integrated with existing database systems as it has been designed to work with the existing indexing structures such as B-trees. The fact that the database is encrypted can be made transparent to the applications.

Mathematical definitions

Definition 1. Encryption is an injective mapping (F) from the set of plaintexts (P) into the set of ciphertext (C): F:P →C.

Definition 2. OPES is the deterministic encryption scheme with a symmetric key which preserves the order of plaintexts.

Definition 3. Let m≤n, P={i| 1≤i≤m} - is the set of plaintexts, С= {i| 1≤i≤n} - is the set of ciphertexts. SEm,n= (Km,n, Em,n, Dm,n) is the deterministic symmetric encryption scheme, where Km,n : {0,1}*→ {0,1}* is the key generation function, Em,n:P⨯{0,1}*→C is the deterministic symmetric encryption algorithm, Dm,n:С⨯{0,1}*→Р is the decryption algorithm, such that ∀ x ∈ P and any valid key k, x < x` ⇔ Em,n(x,k)< Em,n(x`,k).

Definition 4. The numeral system is called the system of methods and rules allowing to establish a one-to-one correspondence between any number and its representation as a set of finite number of symbols.

B-trees

As you know[BCO11] that any order relation on a set can be specified in a binary tree. On the other hand in the theory of data compression the construction of the Huffman tree is a direct manifestation of monotonicity with respect to the order of the symbols relative frequency of occurrence of a given character in the text. More frequently occurring symbol is less than rare. This approach is implemented in the work [PLZ11] . The model proposed by the authors, is a client-server architecture. The client has some data that he needs to be encrypted and stored on the untrusted server side. Encrypted data is stored in a binary search tree. To binary search tree for every node v, all nodes in the left subtree is strictly less than v, and all nodes in the right subtree is strictly greater than v. Additionally, the implementation uses B-trees, since they have advantages such as logarithmic time insert, delete, search, can effectively carry out updates chiprotect. At each node of the tree is stored encrypted with the client key, the plaintext, according to the corresponding order relation. To encrypt the values are deterministic algorithms (DET), responsible with the cryptographic strength the cryptographic security of a random function, for example AES. A tree with such properties (OPE Tree), shown in Figure 1.

Figure 1 - OPE tree and OPE table for this tree

This method has several inevitable drawbacks associated with the use of binary B-tree. The complexity of binary search tree in the worst case O(n), average O(logn), where n is the number of tops in the tree. However, the plaintext cannot be encrypted, before it found its position in the tree, therefore, at each step required to carry out the decryption of the corresponding tree nodes. The complexity of insert and delete is similar, and also at each step requires interaction between the client and the server and performing operations of decryption. Thus, the client to encrypt the n records will be on average O(logn). In addition, to prevent the growth of siprotect and maintain approximately the same length left and right subtrees in each node, in the process of working with wood is required to conduct a balancing operation. Although the authors suggest the technique of balance, in the best case requires O(logn), with a large number of records on the server this operation will take time.

Also as a result of balancing some chipertext can become invalid, that is, chipertext can no longer match their positions in the tree. For example, after the client receives the ciphertext for the given value, then the server is balancing operation. In order to avoid such a situation, the server maintains OPE table, displaying chiprotect in OPE codes. OPE codes in the table is updated every time when you insert into the tree or the process of balancing. Use OPE table, the server can determine the position of the ciphertext in the tree without reference to the client, also OPE table can be used as a cache while encrypting the duplicate values.

In General, the encryption algorithm is as follows:

  1. The client calculate DET ciphertext c and send it to the server.
  2. If ciphertext c is already in OPE table, then the tree do not change.
  3. Else:
    1. Bypass the OPE tree when the client communicates with the server, which is determined by the insertion of ciphertext c and its OPE code. OPE code is stored in the table.
    2. If the balancing of a tree operation is required , then afterwards, update the OPE codes in the table.
    3. In the result of theoperations on the server OPE tree and OPE table are modified .

Для определения порядка открытых текстов соответствующих шифротекстов используются OPE коды из OPE таблицы. To determine the order of plaintexts corresponding ciphertext used OPEcodes from the OPE table.

Encryption schemes

Summation of Random Numbers

It is a simple scheme described in [B02]that computes the encrypted value с of integer p as OPE 9.png , where OPE 2.png is the j-th value generated by a secure pseudo-random number generator R. Unfortunately, the cost of making р, calls to R for encrypting or decrypting с can be prohibitive for large values of p .

Secret key is a random number generator.

The scheme is OPES, as larger х corresponds to larger value of the ciphertext. The scheme has disadvantages. First, the higher the value, the longer the sequence must be generated for encryption and for decryption. Secondly, the distribution of plaintexts can be detected based on the distribution of siprotect.

A more serious problem is the vulnerability to estimation exposure. Since the expected gap between two encrypted values is proportional to the gap between the corresponding plaintext values, the nature of the plaintext distribution can be inferred from the encrypted values. Figure 2 shows the distributions of encrypted values obtained using this scheme for data values sampled from two different distributions: Uniform and Gaussian. In each case, once both the input and encrypted distributions are scaled to be between 0 and 1, the number of points in each bucket is almost identical for the plaintext and encrypted distributions. Thus the percentile of a point in the encrypted distribution is also identical to its percentile in the plaintext distribution.

Figure 2 - Summation of random numbers

Polynomial Functions

As was written in [GS03], sequence of strictly increasing polynomial functions is used for encrypting integer values while preserving their order. These polynomial functions can simply be of the first or second order, with coefficients generated from the encryption key.

Secret key is the coefficients of the polynomial functions, which can be even first or second order only. Distribution of siprotect and open texts differ, but, as stated in the article, with some analysis of the distribution of siprotect, it is also possible to draw conclusions about the distribution of plaintexts.

An integer value is encrypted by applying the functions in such a way that the output of a function becomes the input of the next function. Correspondingly, an encrypted value is decrypted by solving these functions in reverse order. However, this encryptionmethod does not take the input distribution into account. Therefore the shape of the distribution of encrypted values depends on the shape of the input distribution, as shown in Figure 3 for the encryption function, described in [GS03. This illustration suggests that this scheme may reveal information about the input distribution, which can be exploited.

Figure 3 - Polynomial Functions

Agrawal encryption scheme

In [PLZ11] was describes an approach. The interval [0; T] –is a set of plainttext (natural numbers). The interval is randomly divided into many segments, each segment is mapped to a monotonically increasing linear function. Functions are selected so as to form a strictly monotonically increasing piecewise linear function on the interval [0; T], to preserve order.

The secret key consists of the described function and their compliance with the intervals.

Encrypted number x belongs to one of the segments of the interval [0; T]. The number is input to the corresponding segment linear function. The result of the function is the result of encryption.

The union of the segments formed by areas of a value of the key determines the set of chiphertext. An attacker with access to chiphertext can identify the boundaries of the regions of a value of a key, to determine which of the segments of the interval got his encrypted plaintext. Since the function is linear, then the attacker can guess the encryption key.

The coefficients, suggested in [HILM02] are vulnerable to many attacks. Instead, you can build a B-tree over text values, but then you need to encrypt every block and B-tree at each level using symmetric encryption. The advantage of this approach is that the content of B-tree is not visible even if you use an untrusted database server. The disadvantage is that B-trees can now be viewed only by executing a sequence of queries that retrieve tree nodes at successively deeper level.

Security

The security of an encryption scheme is conventionally assessed by analyzing whether an adversary can find the key used for encryption. The articles [S96], [S02] relate to the classification of different levels of attacks on cryptosystems.

When dealing with sensitive numeric data, an adversary does not have to determine the exact data value p corresponding to an encrypted value с;a breach may occur if the adversary succeeds in obtaining a tight estimate of p. For a numeric domain P, if an adversary can estimate with с% confidence that a data value p lies within the interval[р1; р2] then the interval width (р2 - р1)/(domain-width (P) ) defines the amount of estimation exposure at C% confidence level.

Clearly, any order-preserving encryption scheme is vulnerable to tight estimation exposure if the adversary can choose any number of unencrypted (encrypted) values of his liking and encrypt (decrypt) them into their corresponding encrypted (unencrypted) values. Similarly, any order-preserving encryption is not secure against tight estimation exposure if the adversary can guess the domain and knows the distribution of values in that domain.

Modular Order-Preserving Encryption

A modular order-preserving encryption(MOPE) scheme is an extension to OPE that increases its security. Instead of defining such a scheme in general, we define a transformation to obtain it from a given OPE scheme.

The transformation. Let OPE = (Kg`; Enc`; Dec`) be an OPE scheme. We define the associated modular OPE: MOPE[OPE] = (Kg; Enc; Dec) , where

Kg  generates  K←s Kg` и j←s [M];  it outputs  (K, j) .
Enc , on inputs a key K and a plaintext m outputs Enc`(K, m+j mod M) .
Dec , on inputs a key K and a ciphertext coutputs Dex`(K, c) - j mod M .

Above, the value j in the secret key of MOPE[OPE] is called the secret offset or displacement .


The problem

With MOPE, as is also the case for OPE, it is very easy for the client to execute range queries: to make a range query [ml;mr], for which 1 ≤ ml ≤ mr ≤ М, the client computes the pair (cl; cr) = (Enc(K;ml), Enc(K;mr))and sends it to the server. The server then returns all database ciphertexts found in the range [cl; cr]. Note that in an MOPE scheme, this ciphertext interval could "wrap around" the space, i.e., if cr < cl, then the server returns database ciphertexts found in [cl;N] and [1; cr]. For simplicity, we will use the notation [cl; cr] independent of whether or not the interval wraps around.


However, how was admitted in [S96], there is a security vulnerability introduced by making range queries with MOPE. Note that all valid range queries, i.e., those [ml;mr], for which 1 ≤ ml ≤ mr ≤ М, when encrypted may "cover" every value in the ciphertext-space [N] except from those ciphertexts lying between Enc(K;M) and Enc(K;1). Therefore, after observing enough queries, the adversary can get a better idea of where this gap lies, increasing its probability to predict j. In Figure. 3 we show the histogram of the start values of a set of range queries for a small domain and displacement j= 20. In particular, we assume that the domain (and the range) is [0; 100] and that all possible valid range queries with length 10 (k = 10) are generated and executed. In that case, the adversary will observe that there are no queries that start between the values 10 and 20, which correspond to the end of the domain before the displacement. Therefore, it can easily infer that the displacement is 20.

One natural step toward avoiding this attack is to introduce wrap-around queries [GS03]. A wrap-around range query corresponds to a pair (ml, mr), for which mr < ml. The desired interval of values wraps around the space, and is [mr;M] and [1; ml]. MOPE schemes naturally support wraparound range queries in the same manner as standard range queries. These queries are not practically useful, but can be used as "dummy queries" in fooling the gap attack.

Figure 3 - The histogram of the start values

Uniform Query Algorithm

The idea behind our basic query algorithm is as follows. We suppose that the user's queries come from some distribution Q. Each time we need to make a query, we ip an α-biased coin, which returns heads with probability α, to decide whether to make a query from Q or from some other distribution Q`. We will choose α and Q`, such that the convex combination Q + (1- α) Q `= U, where U is the uniform distribution on the entire query space considered, including wrap-around queries in the case of MOPE. This solves two problems: (1) Q may not be uniform, even on standard (non-wrap-around) queries, invalidating the techniques for the security, developed in [BCO11] и (2) Q depends on the secret offset, whileU clearly does not, and consequently hides all information about it.

DefinitionLet D be a probability distribution on a finite set S. Define µd=max D(i), i∈S, and define the completion of D, denoted D`, as : OPE 6.png, for all i∈S.

Representation of the query distribution

To represent the query distribution we use a histogram on the query domain. However, since each query can have different start location and length, to represent all possible queries, we need to keep О(M2) values. In order to address this problem, we pick a fixed query length k≥1 and we decompose every query into a set of queries with length k. So, if an original user query has length smaller than k, we use a single fixed query that starts at the same location as the original query. On the other hand, if the original user query is larger than k, we split the query into a set of range queries of size k, where again the first fixed query starts from the start location of the original query. This approach guarantees that we need only О(M) values to represent the query histogram. Notice that if a user's query q is decomposed into multiple fixed length queries, the results of all the decomposed queries should be returned to the user.

The query algorithm

Describe the algorithm, which we call QueryU, which processes a new query q. The algorithm is initialized with the completion distributionQ`, the maximum query probability µq and a fixed length query k. We denote by Bern( α) the Bernoulli distribution with parameter α, i. e., Bern(α) = 1 with probability α. Note that the input to the algorithm is just the start location of the query, since the query has a fixed length. The algorithm is shown on Figure 4.

Figure 4 - Uniforn query algorithm

The algorithm described above should be repeated until, until all the user requests will not be fulfilled. In the last line, is randomly chosen item from the buffer, but the buffer itself remains unchanged.

Glossary

Bibliography

Move to bibliography for "Order-preserving encryption" article.

Move to article "Order-preserving encryption" in Russian.

Move to Main Page

Isupova О., 2015