Achieving Privacy-preserving Distributed Statistical Computation
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of Achieving Privacy-preserving Distributed Statistical Computation
Achieving Privacy-preserving Distributed
Statistical Computation
A thesis submitted to the University of Manchester
for the degree of Doctor of Philosophy
in the Faculty of Engineering and Physical Sciences
2012
By
Meng-Chang Liu
The School of Computer Science
.
2
Contents…………
CONTENTS…………................................................................................................... 2
LIST OF TABLES ......................................................................................................... 8
LISTS OF FIGURES ................................................................................................... 10
ABBREVIATIONS ..................................................................................................... 14
ABSTRACT…………................................................................................................. 16
DECLARATION……… ............................................................................................. 17
COPYRIGHT AND THE OWNERSHIP OF INTELLECTUAL PROPERTY
RIGHTS ........................................................................................... 18
DEDICATION……… ................................................................................................. 19
ACKNOWLEDGEMENTS ......................................................................................... 20
CHAPTER 1 INTRODUCTION ........................................................................... 21
1.1 DISTRIBUTED STATISTICAL COMPUTATION ...................................... 21
1.2 PRIVACY CONCERNS IN DISTRIBUTED STATISTICAL
COMPUTATION ................................................................................. 21
1.3 RESEARCH MOTIVATION AND CHALLENGES .................................... 23
1.4 RESEARCH AIM AND OBJECTIVES ..................................................... 26
1.5 RESEARCH METHOD ......................................................................... 27
1.6 NOVEL CONTRIBUTIONS ................................................................... 29
1.7 THESIS STRUCTURE .......................................................................... 33
CHAPTER 2 LITERATURE SURVEY: PRIVACY-PRESERVING
DISTRIBUTED DATA COMPUTATION ..................................... 34
3
2.1 CHAPTER INTRODUCTION ................................................................. 34
2.2 TERMINOLOGIES AND DEFINITIONS .................................................. 34
2.2.1 Computation Models ....................................................................... 34
2.2.2 Data Partitioning Models ................................................................ 44
2.2.3 Adversarial Behaviours ................................................................... 47
2.2.4 Data Privacy Definitions Used in Related Works ........................... 49
2.3 PRIVACY-PRESERVING DISTRIBUTED DATA COMPUTATION:
STATE-OF-THE-ART .......................................................................... 50
2.3.1 Secure Multi-party Computation (SMC) ......................................... 51
2.3.2 Privacy-preserving Data Mining (PPDM) ...................................... 59
2.3.3 Privacy-preserving Distributed Statistical Computation
(PPDSC) .......................................................................................... 65
2.4 IDENTIFICATION OF THE RESEARCH GAP .......................................... 74
2.5 THE BEST WAY FORWARD ............................................................... 75
2.6 CHAPTER SUMMARY ........................................................................ 76
CHAPTER 3 DESIGN PRELIMINARIES AND EVALUATION
METHOD ........................................................................................ 77
3.1 CHAPTER INTRODUCTION ................................................................. 77
3.2 DEFINITION OF DATA PRIVACY ........................................................ 77
3.3 THE NST COMPUTATION .................................................................. 79
3.3.1 The NST Computation Problem ...................................................... 81
3.3.2 The TTP-NST Algorithm .................................................................. 81
4
3.4 DESIGN REQUIREMENTS ................................................................... 85
3.5 EVALUATION STRATEGY .................................................................. 87
3.5.1 Correctness ...................................................................................... 87
3.5.2 Level of Security .............................................................................. 88
3.5.3 Computational Overhead ................................................................ 88
3.5.4 Communication Overhead ............................................................... 89
3.5.5 Execution Time ................................................................................ 89
3.6 SIMULATION METHOD ...................................................................... 89
3.6.1 Assumptions ..................................................................................... 89
3.7 CHAPTER SUMMARY ........................................................................ 91
CHAPTER 4 PRIVACY-PRESERVING BUILDING BLOCKS ........................ 92
4.1 CHAPTER INTRODUCTION ................................................................. 92
4.2 DATA PERTURBATION TECHNIQUES ................................................. 92
4.2.1 Data Swapping ................................................................................ 92
4.2.2 Data Randomization ........................................................................ 93
4.2.3 Data Transformation ....................................................................... 95
4.3 CRYPTOGRAPHIC PRIMITIVES ........................................................... 96
4.3.1 Additively Homomorphic Cryptosystem .......................................... 96
4.4 A COMPARISON OF PRIVACY-PRESERVING BUILDING BLOCKS ....... 102
4.5 CHAPTER SUMMARY ...................................................................... 104
5
CHAPTER 5 A NOVEL PRIVACY-PRESERVING TWO-PARTY
NONPARAMETRIC SIGN TEST PROTOCOL SUITE
USING DATA PERTURBATION TECHNIQUES (P22NSTP) ... 105
5.1 CHAPTER INTRODUCTION ............................................................... 105
5.2 OVERVIEW OF THE P22NSTP PROTOCOL SUITE .............................. 105
5.3 THE DESIGN IN DETAIL .................................................................. 108
5.3.1 Computation Participants and Message Objects .......................... 108
5.3.2 Components of the P22NSTP Protocol Suite .................................. 109
5.4 THE P22NSTP PROTOCOL SUITE AND ITS OPERATION .................... 128
5.4.1 Operation of the P22NSTP Protocol Suite ..................................... 128
5.4.2 Correctness .................................................................................... 128
5.4.3 Protocol Analysis .......................................................................... 129
5.5 CHAPTER SUMMARY ...................................................................... 143
CHAPTER 6 A NOVEL PRIVACY-PRESERVING TWO-PARTY
NONPARAMETRIC SIGN TEST PROTOCOL SUITE
USING CRYPTOGRAPHIC PRIMITIVES (P22NSTC) .............. 144
6.1 CHAPTER INTRODUCTION ............................................................... 144
6.2 OVERVIEW OF THE P22NSTC PROTOCOL SUITE .............................. 144
6.3 THE DESIGN IN DETAIL .................................................................. 148
6.3.1 Computation Participants and Message Objects .......................... 148
6.3.2 Components of the P22NSTC Protocol Suite .................................. 150
6.4 THE P22NSTC PROTOCOL SUITE AND ITS OPERATION .................... 154
6.4.1 The Operation ................................................................................ 155
6
6.4.2 Correctness .................................................................................... 158
6.4.3 Protocol Analysis .......................................................................... 159
6.5 CHAPTER SUMMARY ...................................................................... 170
CHAPTER 7 A COMPARISON OF THE TTP-NST, THE P22NSTP AND
THE P22NSTC ............................................................................... 171
7.1 CHAPTER INTRODUCTION ............................................................... 171
7.2 A COMPARISON OF PRIVACY PROTECTION ..................................... 172
7.3 A COMPARISON OF COMPUTATIONAL OVERHEAD .......................... 174
7.4 A COMPARISON OF COMMUNICATION OVERHEAD ......................... 175
7.5 A COMPARISON OF EXECUTION TIME ............................................. 176
7.6 FURTHER DISCUSSIONS .................................................................. 179
7.7 CHAPTER SUMMARY ...................................................................... 181
CHAPTER 8 CONCLUSION AND FUTURE WORK ...................................... 182
8.1 THESIS SUMMARY .......................................................................... 182
8.1.1 Review of the Thesis ...................................................................... 182
8.1.2 Contributions ................................................................................. 184
8.2 FUTURE WORK ............................................................................... 186
REFERENCES………… .......................................................................................... 189
APPENDIX……… .................................................................................................... 209
A. DEFINITIONS OF PRIVACY ............................................................... 209
B. PROTOCOL PROTOTYPES ................................................................. 217
8
List of Tables
TABLE 1. A COMPARISON OF SOLUTIONS TO THE YMP. .................................. 54
TABLE 2. AN EXAMPLE TABLE OF FREQUENCY COUNT FOR DATA
SUBJECTS WHOSE AGE IS BETWEEN 1 TO 10 AND WHO LIVE IN
AREA A1 TO A4. ......................................................................................... 67
TABLE 3. AN EXAMPLE OF ATTRIBUTE RECODING. .......................................... 67
TABLE 4. AN EXAMPLE OF TOP-RECODING. ........................................................ 68
TABLE 5. A TABLE WITH SENSITIVE CELL (C,B). ................................................ 68
TABLE 6. A TABLE WITH SENSITIVE CELLS SUPPRESSED AND FURTHER
COMPLEMENTARY SUPPRESSION MADE. .......................................... 69
TABLE 7. A TABLE WITH CELL VALUES BEEN RANDOMLY ROUNDED TO A
BASE OF 5. ................................................................................................... 69
TABLE 8. THE ENTROPY VALUES. ........................................................................ 131
TABLE 9. A TABLE OF SAMPLE SIZE VERSUS THE INCREMENTS OF
ENTROPY WHEN THE LEVEL OF NOISE DATA ITEM ADDITION IS
INCREASED. .............................................................................................. 134
TABLE 10. THE ENTROPY VALUE AND THE INCREMENT VERSUS 7k . (K7 = 10,
20, 30, 40, 50, 60, 70, 80, 90, 100.) ............................................................. 136
TABLE 11. THE ENTROPY VALUE AND THE INCREMENT VERSUS 7k . (K7 = 100,
200, 300, 400, 500, 600, 700, 800, 900, 1000.) ........................................... 137
TABLE 12. THE ENTROPY VALUE AND THE INCREMENT VERSUS 7k . (K7 =
1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000.) ............... 137
TABLE 13. THE ENTROPY VALUE AND THE INCREMENT VERSUS SAMPLE
SIZE. (SAMPLE SIZE N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.) ......... 162
9
TABLE 14. THE ENTROPY VALUE AND THE INCREMENT VERSUS SAMPLE
SIZE. (SAMPLE SIZE N = 100, 200, 300, 400, 500, 600, 700, 800, 900,
1000.) ........................................................................................................... 162
TABLE 15. THE ENTROPY VALUE AND THE INCREMENT VERSUS SAMPLE
SIZE. (SAMPLE SIZE N = 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000,
9000, 10000.) ............................................................................................... 162
TABLE 16. THE ENTROPY VALUE AND THE INCREMENT VERSUS SAMPLE
SIZE. (SAMPLE SIZE N = 10000, 20000, 30000, 40000, 50000, 60000,
70000, 80000, 90000, 100000.) ................................................................... 162
TABLE 17. THE EXECUTION TIME OF THE TTP-NST, THE P22NSTP AND THE
P22NSTC (SEC). .......................................................................................... 176
TABLE 18. A TABLE OF PROTOCOL EFFICIENCY FOR THE TTP-NST, THE
P22NSTP AND THE P
22NSTC. .................................................................... 177
10
Lists of Figures
FIGURE 1. THE TRUSTED THIRD PARTY (TTP) COMPUTATION MODEL. ........ 35
FIGURE 2. AN EXAMPLE OF THE COMMODITY SERVER MODEL. .................... 37
FIGURE 3. AN EXAMPLE OF THE PROGRAM ISSUER MODEL. ........................... 38
FIGURE 4. AN EXAMPLE OF FAIRNESS CHECKER STTP MODEL. ...................... 39
FIGURE 5. AN EXAMPLE OF THE ON-LINE STTP MODEL. ................................... 40
FIGURE 6. AN EXAMPLE OF THE TWO-PARTY COMPUTATION MODEL. ........ 42
FIGURE 7. AN EXAMPLE OF A N-PARTY COMPUTATION. .................................. 43
FIGURE 8. AN EXAMPLE OF A VERTICALLY PARTITIONED DATA MODEL IN
THE TWO-PARTY COMPUTATION. ........................................................ 45
FIGURE 9. AN EXAMPLE OF A HORIZONTALLY PARTITIONED DATA MODEL
IN THE TWO-PARTY COMPUTATION. ................................................... 46
FIGURE 10. AN EXAMPLE OF THE V2PO DATA MODEL......................................... 47
FIGURE 11. A TAXONOMY OF THE DEVELOPED PPDM ALGORITHMS. ............. 65
FIGURE 12. THE TTP-NST COMPUTATION. ............................................................... 82
FIGURE 13. THE TTP-NST ALGORITHM...................................................................... 84
FIGURE 14. AN EXAMPLE OF DATA SWAPPING OPERATION. ............................. 93
FIGURE 15. AN EXAMPLE OF NOISE VALUE ADDITION RANDOMISATION. .... 94
FIGURE 16. AN EXAMPLE OF NOISE ADDITION RANDOMISATION. ................... 94
FIGURE 17. THE HOMOMORPHIC CRYPTOSYSTEM. ............................................... 99
FIGURE 18. A COMPARISON OF PRIVACY-PRESERVING BUILDING BLOCKS. 102
FIGURE 19. AN OVERVIEW OF THE P22NSTP COMPUTATION. ............................ 107
FIGURE 20. THE RPDFGP ALGORITHM. ................................................................... 110
11
FIGURE 21. THE DOP ALGORITHM............................................................................ 113
FIGURE 22. THE STCP ALGORITHM. ......................................................................... 118
FIGURE 23. THE DEP ALGORITHM. ........................................................................... 124
FIGURE 24. THE DETAILED RELATIONSHIP AMONG iC , Id
ic ANDT
iR . ............. 126
FIGURE 25. THE PRP ALGORITHM. ........................................................................... 127
FIGURE 26. THE P22NSTP PROTOCOL SUITE OPERATION. ................................... 128
FIGURE 27. THE ENTROPY VALUE VERSUS THE NUMBER OF NOISE DATA
ITEMS (N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100). .................................. 132
FIGURE 28. THE ENTROPY VALUE VERSUS THE NUMBER OF NOISE DATA
ITEMS (N = 10, 100, 1000)......................................................................... 133
FIGURE 29. THE ENTROPY VALUE VERSUS THE NUMBER OF NOISE DATA
ITEMS (N = 1000, 10000, 100000). ............................................................ 133
FIGURE 30. THE ENTROPY VALUE VERSUS THE VALUE OF 7k . (K7 = 10, 20, 30,
40, 50, 60, 70, 80, 90, 100.) ......................................................................... 137
FIGURE 31. THE ENTROPY VALUE VERSUS THE VALUE OF 7k . (K7 = 100, 200,
300, 400, 500, 600, 700, 800, 900, 1000.) ................................................... 138
FIGURE 32. THE ENTROPY VALUE VERSUS THE VALUE OF 7k . (K7 = 1000, 2000,
3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000.) ................................... 138
FIGURE 33. THE ENTROPY VALUE VERSUS THE VALUE OF K7. ........................ 139
FIGURE 34. NUMBER OF COMPUTATIONAL OPERATIONS VS. NUMBER OF
NOISE DATA ITEMS ADDED BY ALICE AND BOB. (SAMPLE SIZE N
= 10, 20, 30.) ................................................................................................ 141
FIGURE 35. TOTAL COMMUNICATION OVERHEAD VS. NUMBER OF NOISE
DATA ITEMS ADDED BY ALICE AND BOB. (SAMPLE SIZE N = 10, 20,
30.) ............................................................................................................... 142
12
FIGURE 36. PROTOCOL SUITE EXECUTION TIME VS. NUMBER OF NOISE DATA
ITEMS ADDED BY ALICE AND BOB. (SAMPLE SIZE N = 10, 20, 30.)
143
FIGURE 37. AN OVERVIEW OF THE P22NSTC COMPUTATION. ............................ 148
FIGURE 38. THE DSP ALGORITHM. ........................................................................... 151
FIGURE 39. THE DRP ALGORITHM. ........................................................................... 152
FIGURE 40. THE P22NSTC ALGORITHM. .................................................................... 156
FIGURE 41. THE ENTROPY VALUE VERSUS SAMPLE SIZE. (SAMPLE SIZE N = 10,
20, 30, 40, 50, 60, 70, 80, 90, 100.) ............................................................. 163
FIGURE 42. THE ENTROPY VALUE VERSUS SAMPLE SIZE. (SAMPLE SIZE N =
100, 200, 300, 400, 500, 600, 700, 800, 900, 1000.) ................................... 163
FIGURE 43. THE ENTROPY VALUE VERSUS SAMPLE SIZE. (SAMPLE SIZE N =
1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000.) ............... 164
FIGURE 44. THE ENTROPY VALUE VERSUS SAMPLE SIZE. (SAMPLE SIZE N =
10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000.)
164
FIGURE 45. THE ENTROPY VALUE VERSUS SAMPLE SIZE. (OVERVIEW) ....... 165
FIGURE 46. COMPUTATIONAL OVERHEAD FOR NON-CRYPTOGRAPHIC
OPERATIONS AND CRYPTOGRAPHIC OPERATIONS VS. NUMBER
OF NOISE DATA ITEMS ADDED BY THE STTP FOR ALICE AND BOB,
RESPECTIVELY. (SAMPLE SIZE N = 10, 20, 30, 40, 50, 60, 70, 80, 90,
100, 200, 300, 400, 500, 600, 700.) ............................................................. 166
FIGURE 47. COMPUTATIONAL OVERHEAD FOR NON-CRYPTOGRAPHIC
OPERATIONS AND CRYPTOGRAPHIC OPERATIONS VS. NUMBER
OF NOISE DATA ITEMS ADDED BY THE STTP. (SAMPLE SIZE N = 10,
20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) ............. 166
FIGURE 48. COMMUNICATION OVERHEAD FOR NON-ENCRYPTED DATA
ITEMS AND ENCRYPTED DATA ITEMS VS. NUMBER OF NOISE
DATA ITEMS ADDED BY THE STTP FOR ALICE AND BOB,
13
RESPECTIVELY. (SAMPLE SIZE N = 10, 20, 30, 40, 50, 60, 70, 80, 90,
100, 200, 300, 400, 500, 600, 700.) ............................................................. 168
FIGURE 49. COMMUNICATION OVERHEAD FOR NON-ENCRYPTED DATA
ITEMS AND ENCRYPTED DATA ITEMS VS. NUMBER OF NOISE
DATA ITEMS ADDED BY THE STTP. (SAMPLE SIZE N = 10, 20, 30, 40,
50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) ............................... 168
FIGURE 50. PROTOCOL SUITE EXECUTION TIME VS. NUMBER OF NOISE DATA
ITEMS ADDED BY THE STTP FOR ALICE AND BOB. (SAMPLE SIZE
N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) 169
FIGURE 51. PROTOCOL SUITE EXECUTION TIME VS. NUMBER OF NOISE DATA
ITEMS ADDED BY THE STTP. (SAMPLE SIZE N = 10, 20, 30, 40, 50, 60,
70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) ........................................... 169
FIGURE 52. A COMPARISON OF PRIVACY PROTECTION BY THE TTP-NST, THE
P22NSTP AND THE P
22NSTC. .................................................................... 172
FIGURE 53. A COMPARISON OF COMPUTATION OVERHEAD. ........................... 174
FIGURE 54. A COMPARISON OF COMMUNICATION OVERHEAD ...................... 175
FIGURE 55. A COMPARISON OF EXECUTION TIME FOR THE TTP-NST, THE
P22NSTP AND THE P
22NSTC (SEC). ......................................................... 176
FIGURE 56. A COMPARISON OF PROTOCOL EFFICIENCY FOR THE TTP-NST,
THE P22NSTP AND THE P
22NSTC. ........................................................... 177
14
Abbreviations
DEP Data Extraction Protocol
DOP Data Obscuring Protocol
DRP Data Randomization Protocol
DSP Data Separation Protocol
NST Nonparametric Sign Test
P22NSTP Privacy-preserving Two-party Sign Test using data Perturbation
Techniques
P22NSTC Privacy-preserving Two-party Sign test using Cryptographic primitives
PIR Private Information Retrieval
PPDDC Privacy-preserving Distributed Data Computation
PPDM Privacy-preserving Data Mining
PRP Permutation Reverse Protocol
RpdfGP Random Probability Density Function Generation Protocol
SDC Statistical Disclosure Control
SLRC Secure Linear Regression Computation
SMC Secure Multi-party Computation
SMPC Secure Matrix Product Computation
SPFE Selective Private Function Evaluation
SSC Secure Statistical Computation
STCP Secure Two-party Comparison Protocol
15
STTP Semi-trusted Third Party
TP Third Party
TTP Trusted Third Party
TTP-NST TTP-based Nonparametric Sign Test Computation
YMP Yao’s Millionaires’ Problem
16
Abstract…………
Name of the University: University of Manchester
The Candidate’s full name: Meng-chang Liu
Degree Title: Doctor of Philosophy
Thesis Title: Achieving Privacy-preserving Distributed Statistical Computation
Date: 30 September 2011
The growth of the Internet has opened up tremendous opportunities for cooperative
computations where the results depend on the private data inputs of distributed
participating parties. In most cases, such computations are performed by multiple
mutually untrusting parties. This has led the research community into studying
methods for performing computation across the Internet securely and efficiently.
This thesis investigates security methods in the search for an optimum solution to
privacy- preserving distributed statistical computation problems. For this purpose, the
nonparametric sign test algorithm is chosen as a case for study to demonstrate our
research methodology. Two privacy-preserving protocol suites using data perturbation
techniques and cryptographic primitives are designed. The first protocol suite, i.e. the
P22NSTP, is based on five novel data perturbation building blocks, i.e. the random
probability density function generation protocol (RpdfGP), the data obscuring
protocol (DOP), the secure two-party comparison protocol (STCP), the data extraction
protocol (DEP) and the permutation reverse protocol (PRP). This protocol suite
enables two parties to efficiently and securely perform the sign test computation
without the use of a third party. The second protocol suite, i.e. the P22NSTC, uses an
additively homomorphic encryption scheme and two novel building blocks, i.e. the
data separation protocol (DSP) and data randomization protocol (DRP). With some
assistance from an on-line STTP, this protocol suite provides an alternative solution
for two parties to achieve a secure privacy-preserving nonparametric sign test
computation. These two protocol suites have been implemented using MATLAB
software. Their implementations are evaluated and compared against the sign test
computation algorithm on an ideal trusted third party model (TTP-NST) in terms of
security, computation and communication overheads and protocol execution times.
By managing the level of noise data item addition, the P22NSTP can achieve specific
levels of privacy protection to fit particular computation scenarios. Alternatively, the
P22NSTC provides a more secure solution than the P
22NSTP by employing an on-line
STTP. The level of privacy protection relies on the use of an additively homomorphic
encryption scheme, DSP and DRP. A four-phase privacy-preserving transformation
methodology has also been demonstrated; it includes data privacy definition,
statistical algorithm decomposition, solution design and solution implementation.
17
Declaration………
No portion of the work referred to in the thesis has been submitted in support of an
application for another degree or qualification of this or any other university or other
institute of learning
Signed………………………………………………………………………………….
18
Copyright and the Ownership of Intellectual Property
Rights
i. Copyright in text of this thesis rests with the Author. Copies (by any process)
either in full, or of extracts, may be made only in accordance with instructions
given by the Author and lodged in the John Ryland’s University Library of
Manchester. Details of may be obtained form the Librarian. This page must form
part of any such copies made. Further copies (by any process) of copies made in
accordance with such instructions may not be made without the permission (in
writing) of the Author.
ii. The ownership of any intellectual property rights which may be described in this
thesis is vested in the University of Manchester, subject to any prior agreement to
the contrary, and may not be made available for use by third parties without the
written permission of the University, which will prescribe the terms and
conditions of any such agreement.
iii. Further information on the conditions under which disclosures and exploitation
may take place is available form the Head of the School of Computer Science.
20
Acknowledgements
I would like to express my sincerest gratitude to my supervisor, Dr. Ning Zhang, for
her guidance and valuable advice throughout this PhD research.
I would also like to thank my friends and colleagues, Namshik and Kits, for the
valuable discussion about the simulation.
21
Chapter 1 Introduction
1.1 Distributed Statistical Computation
With the advancement of computer and internet technologies, organizations and
individuals are now able to collect and store a great amount of data for research or
business purposes. Consequently, the more records are collected the more valuable
information in the data. Applications of statistical data analysis techniques have
hugely improved the utility of these data for such purposes. Using specified
techniques, the trend or statistical properties of a dataset can easily be computed from
the raw data. Data holders are then able to reveal specialized information from the
computation results. For example, to analyze the 1900-1950 temperature records in
the Arctic using a statistical hypothesis test technique, the hypothesis of the rising
average Arctic temperature can be tested and uncovered, and factors affecting global
warming in the Arctic can be identified. Researchers then may be able to investigate
more efficient ways to address the global warming problem. In many cases, more
reliable and useful information can be learnt from datasets contributed by multiple
data holders. For example, Russia, Norway, the Unites States, Canada and Denmark
each own part of the Arctic. They each hold the regional temperature records of the
Arctic. The information uncovered from one country’s database can not be used to
represent the actual situation of the whole Arctic. If all the five countries can share
their temperature records with one another and perform statistical hypothesis tests
collaboratively, this information may bring to light more truth about the actual
situation in the Arctic than the results based on regional data. In other words, more
accurately analysed results require the use of more comprehensive datasets, and in
many cases, datasets are managed by different data holders. A distributed statistical
computation method that supports multiple parties to collaboratively compute on their
joint dataset would be an ideal solution.
1.2 Privacy Concerns in Distributed Statistical Computation
Distributed statistical computation, however, raises privacy concerns among data
holders (e.g. data providers) as well as data owners (i.e. data subjects) [CLIF’02b].
Privacy is the right of individuals to determine for themselves when, how and to what
22
extent information about them is communicated to others. Information in this context
concerns not only the raw data about an individual (e.g. name, age and sex), but also
their credentials (e.g. degree certificate and invalidity benefit entitlement) and their
policies (e.g. I never tell anyone my annual income). The data owners are data
subjects that submit their private data items to data providers while the data providers
are individuals or organizations that collect data and manage data repositories. In
other words, data providers are third parties managing and using data provided by
data owners. One of the key issues in protecting a data subject’s privacy by its data
holder is how to manage the identifiable information so that the data released for other
purposes does not violate the data subject’s wishes (i.e. privacy policies). The
European Community regulation [DIRE’95] and U.S.HIPAA rules [OCR’03] are
regulations to govern data providers, i.e. they are held legally responsible for making
sure that data under their management are protected properly. In addition, data owners
also have privacy concerns regarding whether data providers would use their data for
purposes that are different from what their data are submitted for. These concerns, if
not addressed properly, will deter their willingness to submit their data to data
providers.
Concerted research efforts have been made to search for ways to achieve privacy-
preserving collaborative computation. These efforts can largely be divided into two
categories:
Anonymising the data (i.e. removing all information that can directly link data
items to their owners) so that it can not be traced to the identity of the subject
[SAMA’98, SWEE’02a, WU’06]. There are two problems associated with this
approach. The first problem is that there is a fine balance between the
usefulness of the data and the degree of anonymisation. A high degree of
anonymisation resulting in a high level of privacy protection often means that
the resulting data is useless. Secondly, anonymising the data could make re-
identification of the data subject impossible. However, such re-identification is
necessary for applications such as health care: simply removing one’s
identifying information may not be sufficient to protect his/her privacy. There
are examples showing that even if the identity related information is removed
from the released data, combining the data with other available information
23
sources or knowledge may still expose the identity of the data subject
[CAST’10].
Making use of cryptographic primitives and/or other data perturbation
techniques to preserve data privacy while supporting distributed computation.
This category is sometimes also referred to as secure multiparty computation
(SMC) [DU’01b, DU’01c, GOLD’04]. When a function is computed on a joint
dataset containing data inputs from multiple parties, cryptographic primitives
and/or other privacy-preserving techniques are employed to ensure that no
more information is revealed to a participating party other than the final
computational result. For example, privacy-preserving data mining (PPDM)
and privacy-preserving distributed statistical computation [PPDSC]. PPDM
focuses on transforming normal data mining computation algorithms to
privacy-preserving ones by using both cryptographic primitives and data
perturbation techniques [AGRA’00, AGRA’01, AGGA’08a]. PPDSC can be
further divided into two subcategories: statistical disclosure control (SDC) and
secure statistical computation (SSC). On the one hand, SDC methods are
mainly related to the data dissemination stage and are usually based on
restricting the amount of or modifying the data released [WILL’96, WILL’01].
They are normally applied to two types of data: Microdata and Tabular Data.
Microdata consists of individual statistical records relating to a single
statistical unit. Tabular data is the aggregated information on entities presented
in tables. On the other hand, SSC solutions provide privacy-preserving
algorithms to address privacy concerns when multiple parties perform
statistical computation on the joint dataset [LUO’03, DU’04, KARR’09b].
1.3 Research Motivation and Challenges
Early solutions to the privacy-preserving distributed data computation (PPDDC)
problems [GOLD’87, YAO’87, GOLD’98, GOLD’04] take a generic approach: a
computational problem is firstly described as a combinatorial circuit. The
computation participating parties generate a simple protocol for every gate of this
circuit, respectively. These protocols are then executed during the circuit execution.
Almost all PPDDC problems can be solved by using this approach, which is appealing
because of its generality and simplicity. However, it is neither efficient nor practical
for many research problems. As indicated by Goldreich et al in [GOLD’98,
24
GOLD’04], applying solutions derived from general results to specific cases of SMC
problems can be impractical, and it is preferable that specific solutions are developed
for specific SMC problems in order to devise more efficient solutions suited to
different application contexts. Accordingly, later solutions have been mostly designed
to tackle specific data computation problems. To date little effort has been expended
on designing privacy-preserving solutions to support distributed statistical
computations [DU’01c]. As mentioned by Goldwasser regarding the importance of
this research field [GOLD’97]: “the field of multi-party computation is today where
public-key cryptography was ten years ago, namely, an extremely powerful tool and
rich theory whose real-life usage is at this time only beginning but will become in the
future an integral part of our computing reality”.
Motivated by these observations, this thesis focuses on the study, research and
development of an efficient and practical solution to the privacy-preserving
distributed statistical computation (PPDSC) problem. The two-party nonparametric
sign test computation (P22NST) problem is chosen as a case for study. Two novel
protocol suites have been developed, namely, the privacy-preserving two-party sign
test protocol suite using data perturbation techniques (P22NSTP) and the privacy-
preserving two-party sign test protocol suite using cryptographic primitives
(P22NSTC). In designing these protocols, the following challenges have been
addressed:
The Definition of Data Privacy
A specific computation problem involves a specific type of dataset which
associates with particular type of data privacy. Prior to investigating and
designing solutions to the P22NST problem, it is necessary to clarify the
definition of data privacy in the PPDSC context.
The Identification of Potential Security Threats in the Context of the
P22NST Computation
It is critical that a secure P22NST protocol should not only protect data privacy
but also allow the computation to be carried out. In other words, it should
protect data privacy against security threats throughout the entirety of the
computation. For this purpose, the sign test computation algorithm on an ideal
25
trusted third party model (TTP-NST) is decomposed and threats to data
privacy at each step of the computation have been analysed and identified.
The Identification of Appropriate Privacy-preserving Primitives to
Counter Threats to Data Privacy in the P22NST Computation
There are different privacy-preserving primitives each with varying levels of
effectiveness and efficiency. Work has been carried out to identify and
critically analyse various privacy-preserving primitives with the aim of
identifying the most appropriate ones to support our purpose. Two types of
techniques are considered, namely the data perturbation techniques and the
cryptographic primitives. Often there is a trade-off between costs and the
effectiveness of a primitive in protecting data privacy. Acceptable trade-offs
may need to be made between privacy protection and performance
(computational and communicational) efficiency in designing a P22NST
solution in order to achieve a cost-effective and practical solution to the
P22NST problem.
The Design of Efficient and Practical Protocol Suites to the P22NST
Problem
Two computation models have been investigated, namely the two-party model
and the Semi-trusted Third Party (STTP) model. Based on these computation
models and identified privacy-preserving primitives, two protocol suites have
been designed to support the P22NST computation. In these designs, two data
partitioning models are considered: one is the vertically partitioned
(heterogeneous) model and the other is the horizontally partitioned
(homogeneous) model. Data perturbation techniques and cryptographic
primitives have been used in these protocol suite designs.
The Prototype of the Two Protocol Suites Using MATLAB
The two designed protocol suites, i.e. the P22NSTP and the P
22NSTC protocol
suites, have been prototyped and implemented using MATLAB software. A
set of experiments have been planned and carried out. The performances of the
two protocol suites have been evaluated using these experimental results. A
26
comparison between these two protocol suites against the TTP-NST model has
also been made to identify the features of each protocol suite.
The Development of A Systematic Methodology to the P2DSC
Computation
Through the process of designing solutions to the P22NST problem, a
systematic methodology has been developed in search of an efficient and
practical solution. This approach incorporates privacy definition, statistical
algorithm decomposition, solution design and solution implementation. It has
been demonstrated that this methodology can be used to design solutions to
the P22NST problem. It can also be used to address other PPDSC problems.
1.4 Research Aim and Objectives
The aim of this research is to search for an optimum solution to the PPDSC problems.
For this purpose, the P22NST computation is chosen as a case for study. The
methodology is demonstrated through the process of designing efficient and practical
solutions to the P22NST problem. That is, to design, prototype and evaluate solutions
that take into account not only data privacy but also other attributes, such as
computational and communication overheads, execution time and the trade-off
between efficiency and privacy. To accomplish this aim, the following objectives
have been fulfilled:
1. To study the literature: to critically analyse related works in the topic area and
to identify research gaps to be addressed in this thesis.
2. To specify requirements for preserving data privacy while supporting the
distributed computations. The requirements are aimed at minimizing any
privacy disclosure from data inputs while optimising performance.
3. To investigate and critically analyse state-of-the-art privacy-preserving
primitives and building blocks in PPDDC and identify those that best satisfy
the requirements identified in 2.
4. To design the P22NST protocol suites based on the primitives and building
blocks identified in 3. The design should take into account different data
partitioning models and computational approaches.
5. To prototype the designed protocol suites.
27
6. To informally analyze the security strength of the designed protocol suites
against design requirements.
7. To evaluate the privacy levels, computational and communication overheads
and execution times of the designed protocol suites.
8. To publish research results and write up the PhD thesis.
1.5 Research Method
The four major tasks of this research together with their respective methods of
execution are described below.
Task 1: Literature Review
The first task was to study the area of secure distributed computation. The focus at
this stage was to critically analyse related work in this field and to identify research
gaps or areas where novel contribution could be made. At the end of the literature
review stage, the PPDSC problem was pinpointed as a topic for future work. More
specifically, the privacy-preserving distributed nonparametric sign test (P2DNST)
computation problem was selected as the case for study. Then further literature study
was carried out in the search for an effective and efficient approach to the P2DNST
problem. During the research it became apparent that a new definition of privacy was
needed. This led to the definition of the three-level data privacy definition: individual
data confidentiality, individual privacy and corporate privacy.
Task 2: Theoretical Work
Following the definition of data privacy, the algorithm of nonparametric sign test
(NST) was analysed in a distributed computational context. More specifically, it was
analysed under three different settings: trusted third-party (TTP)-based, semi-trusted
third-party (STTP)-based and two-party/multi-party-based, so as to identify potential
ways of compromising privacy in the computation process. These analyses were
carried out under the assumption of the vertically partitioned data model and the
horizontally partitioned data model. Such analyses led to the decomposition of the
TTP-NST algorithm and the identification of local and global computational tasks and
message exchanges needed to accomplish the global computational task. Then, a set
of design requirements were specified to govern the conversion of NST into privacy-
28
preserving NST. Based on the design requirements, more literature research was
carried out investigating the building blocks that could be used to preserve privacy in
distributed sign test computation. This in turn led to the identification of a set of
building blocks satisfying our design requirements, which were used to secure the
input and output of local computations. At the end of this stage, two novel protocol
suites were designed: the Privacy-preserving Two-party Nonparametric Sign Test
Using Data Perturbation Techniques protocol suite (P22NSTP) and the Privacy-
preserving Two-party Nonparametric Sign Test Using Data Cryptographic
Primitives protocol suite (P22NSTC). The review and study of relevant literature were
continued throughout the whole research period, thus enabling the repeated
refinement of ideas by taking merits from existing works. As the designs of the two
protocol suites were published [LIU’10, LIU’11a, LIU’11b], the comments from
referees were also taken into account to further refine this work.
Task 3: Protocol Implementation
On completion of the theoretical work stage, the next step was to implement and
simulate the proposed protocol suites. The implementation and simulation were
carried out using MATLAB software. Before implementation, it was necessary to be
familiarised with the features of this software in order to use its power to the full
potential. Following this, the protocol suites were implemented. To obtain a correct
simulation result, it was necessary to validate the implementation to ensure the
practical outcome was in line with the theoretical design. This was done by simulation
validation. The validation was performed by deriving a mathematical model of a
simplified simulation model and comparing the simulated results with the results
obtained from the mathematical model. Once the validation was completed, the
simulation model could be run with full confidence. The simulation was run in a set of
scenarios and settings to show that the models are practical and applicable to various
computational scenarios and different data and privacy requirements. The simulation
results were recorded and used to analyse the performance of the protocol suites.
Task 4: Evaluation
To evaluate the performance of the protocol suites, results from the simulation were
plotted into graphs, these graphs which in turns were used to compare the
performances of the proposed protocol suites with that of the TTP-NST model.
29
Conclusions of this research were then drawn from this evaluation. A direction for
future work was also identified from the evaluation.
1.6 Novel Contributions
This research has developed a systematic methodology in the search for an optimum
solution to the PPDSC problems. More specifically, it has designed two protocol
suites to address the P22NST computation problem. To the best of the author’s
knowledge, this is the first work in this area to address privacy-preserving statistical
hypothesis testing (P2SHT) problems in a distributed setting. This research work has
made three significant contributions to the knowledge area. The first contribution is
the design, analysis and evaluation of a two-party based protocol suite to allow two
parties to perform nonparametric sign testing with privacy preservation, i.e. the
P22NSTP protocol suite. The P
22NSTP protocol suite makes use of five novel data
disguising protocols to assist the computation. The second contribution is the design,
analysis and evaluation of an on-line STTP based protocol suite to support privacy-
preserving nonparametric sign test computation, i.e. the P22NSTC protocol suite. This
protocol suite makes use of three cryptographic primitives and a STTP, which only
plays an assistant role in facilitating the computation. The P22NSTC protocol suite is
more secure but less efficient than the P22NSTP protocol suite as the former employs
cryptographic primitives. Both protocol suites have been implemented and evaluated
using MATLAB. The third contribution is the development of a four-phase
methodology to transform a normal statistical algorithm to a privacy-preserving
distributed one. The four phases include (1) data privacy definition, (2) statistical
algorithm decomposition, (3) solution design and (4) solution implementation. The
detailed novel contributions are further described below.
Design, Analysis and Evaluation of the P22NSTP Protocol Suite
1. Dataset properties are firstly examined in both vertically partitioned and
horizontally partitioned data models. This is necessary in order to identify
potential security threats in the algorithm decomposition stage and to identify
privacy-preserving building blocks in the design stage.
2. Security threats are identified upon the decomposition of the TTP-NST
algorithm in the two-party computation model. The algorithm decomposition
30
divides the TTP-NST algorithm into local and global computational tasks. By
examining the security threats in every computational task, design
requirements are specified.
3. Based on the design requirements, appropriate data perturbation techniques are
chosen and used to design the components of the P22NST protocol suite.
These techniques help to protect data privacy while supporting the execution
of each computation task. The components of the P22NSTP protocol suite, i.e.
five novel data perturbation protocols, are designed to address security threats
with efficiency considerations. The first protocol is used by both parties to
disguise their respective datasets before they perform the joint computation.
The second to the fifth protocols are each used to accomplish a local
computational task in the joint computation while protecting the
confidentiality of intermediate computation results. These five protocols are:
Random Probability Density Function Generation Protocol (RpdfGP). This
protocol allows a participating party to randomly generate a Probability
Density Function based on their original dataset.
Data Obscuring Protocol (DOP). This protocol protects data confidentiality
of an original dataset using data perturbation techniques.
Secure Two-party Comparison Protocol (STCP). This protocol allows one
of the participating parties to privately compare their datasets without
learning the exact values of the other party’s dataset and the comparison
result.
Data Extraction Protocol (DEP). This protocol reverses the results of data
disguising operations performed by DOP.
Permutation Reverse Protocol (PRP). This protocol reverses the results of
data disguising operations performed by STCP.
The design requirements are addressed by the five protocols mentioned above,
which make use of data perturbation techniques, namely data randomization,
data swapping and data transformation. As data perturbation techniques are
generally more computationally efficient than cryptographic primitives, this
protocol suite provides better efficiency than the second protocol suite, i.e. the
P22NSTC protocol suite.
4. The P22NSTP protocol suite is implemented using MATLAB.
31
5. Both security analysis and performance evaluation of the P22NSTP protocol
suite are carried out. The security analysis is performed against the privacy
requirements. The performance evaluation is carried out in terms of
computational costs, communication costs and execution time. These costs are
evaluated through the simulation of the protocol suite implementation. The
results collected from the simulation are compared against theoretical results
from a TTP-NST model. The effects of different data sizes and the degree of
noise addition on the performance of the protocol suite are also evaluated.
Design, Analysis and Evaluation of the P22NSTC Protocol Suites
1. An on-line STTP is employed in the design of this protocol suite in order to
provide a higher level of security than the P22NSTP.
2. This protocol suite consists of two novel protocols which are designed to
address security threats in the P22NST computation. They are:
Dataset Split Protocol (DSP): This protocol allows STTP to randomly
separate a encrypted dataset G to two subsets 1G and 2G where
21 GGG and 21 GG .
Dataset Randomization Protocol (DRP): This protocol allows STTP to
randomly generate a randomized dataset 'G based on encrypted G where
'GG .
The design of this protocol suite makes use of both data perturbation techniques,
namely data randomization, data swapping and data transformation, and a
cryptographic primitive, namely an additively homomorphic encryption scheme.
As cryptographic primitives are more secure and more computational expensive
than data perturbation techniques, the P22NSTC provides an alternative solution
to the P22NSTP. As the datasets are encrypted before being sent to the STTP,
this protocol suite enables the STTP to compute the data inputs without
decrypting them.
3. The P22NSTC suite is implemented using MATLAB.
4. A security analysis and performance evaluation of the P22NSTC protocol suite
are carried out and the results collected from the simulation are further
compared against the results from the TTP-NST model and the P22NSTP
protocol suite. The effects of different data sizes and the degree of noise
32
addition on the performance of the protocols, in terms of computational and
communication costs and protocol execution time, are evaluated.
Development of the Four-phase Methodology to Transform A Normal
Statistical Algorithm to A Privacy-preserving Distributed One
1. According to the demonstrations of the P22NSTP and the P
22NSTC, the
development of a four-phase methodology can be summarized. The four
phases are: (1) privacy definition, (2) statistical algorithm decomposition, (3)
solution design and (4) solution implementation.
2. The privacy definition phase identifies and defines crucial information that is
needed to be protected in a dataset which is involved in a distributed statistical
computation problem.
3. The algorithm decomposition phase decomposes the original statistical
computation into a set of local and global computational tasks so as to identify
every potential threat to data privacy in the tasks.
4. In the solution design phase, the most appropriate privacy-preserving building
blocks are selected and used in the designs of the solution so as to achieve a
privacy-preserving solution to the distributed statistical computation problem.
5. Finally, the solution implementation phase implements and evaluates the
designed solution so as to ensure the outcome is in line with the theoretic
design.
Publications
Conference Publications
1. M.C. Liu and N. Zhang, (2010), “A Solution to Privacy-preserving Two-party
Sign Test on Vertically Partitioned Data (P22NSTv) Using Data Disguising
Techniques”, the Proceedings of the International Conference on Networking
and Information Technology (ICNIT 2010), pages 526-534, Manila,
Philippines, 11-12 June 2010, IEEE Computer Society Press.
2. M.C. Liu and N. Zhang, (2011), “A Cryptographic Solution to Privacy-
preserving Two-party Sign Test Computation on Vertically Partitioned Data”,
the Proceedings of the 2nd
International Conference on Electronics and
Information Engineering (ICEIE2011), Tianjin, China. 9-11 September 2011.
33
Journal Publication
3. M.C. Liu and N. Zhang, (2012), “A Cryptographic Solution to Privacy-
preserving Two-party Sign Test Computation on Vertically Partitioned Data”,
Advanced Material Research, Volume 403-408, Pages 1249 – 1257, Trans
Tech Publications, Switzerland, doi: 10.4028/www.scientific.net/AMR.403-
408.1249.
1.7 Thesis Structure
This thesis is organized as follows. Chapter 2 gives the background information and
critical analysis of related works in the field of PPDDC. The focus is on SMC, PPDM
and PPDSC. Chapter 3 details the design preliminaries and evaluation methods used
for the design and evaluation of the novel protocol suits presented in this thesis.
Chapter 4 presents a list of privacy-preserving building blocks that could be used to
achieve PPDDC. Chapter 5 describes the first novel protocol suite, the P22NSTP,
along with its theoretical privacy analysis and performance evaluation, while the
second novel protocol suite, the P22NSTC, which employs more secure cryptographic
primitives and an on-line STTP, is described in Chapter 6. Chapter 7 presents
simulation studies of the two protocol suites, the P22NSTP and the P
22NSTC, and
further evaluates their performances. The performance results are compared with
those from the TTP-NST model. Finally, Chapter 8 concludes this thesis and
highlights future work.
34
Chapter 2 Literature Survey: Privacy-preserving
Distributed Data Computation
2.1 Chapter Introduction
This chapter presents a literature survey of the related works in the domain of
Privacy-preserving Distributed Data Computation (PPDDC) and is organised as
follows. Section 2.2 offers terminologies and definitions that are used in the domain
of PPDDC. Section 2.3 critically analyses the related works in literature, which have
been categorised into three research areas, namely, secure multi-party computation
(SMC), privacy-preserving data mining (PPDM) and privacy-preserving distributed
statistical computation (PPDSC). Section 2.4 identifies gaps in existing solutions and
suggests areas for future work; finally, this chapter is summarized in section 2.5.
2.2 Terminologies and Definitions
The basic concepts, terminologies and definitions introduced in this section are all
used in the domain of PPDDC. The concepts include computation models, data
partitioning models, adversarial models and privacy definitions which are all
described here.
2.2.1 Computation Models
Different computation models will raise a corresponding variety of security concerns.
For example, in a computation with the support from a third party (TP) it is
sometimes necessary to preserve confidential data from the TP, while in a distributed
computation scenario, this concern does not exist. Based on the literature survey,
three computation models have been employed in the related works. They are the
trusted-third party (TTP) model [GOLD’98, GOLD’01, GOLD’04], the semi-trusted
third party (STTP) model [ABAD’02, CACH’99, EMEK’06, FEIG’94, FRAN’97,
KANT’02, KNAT’03, BEAV’97, BEAV’98, CACH’00, DU’01b, DU’04, NAOR’99]
and the two-party/multi-party distributed computational model [CAST’04, DOLE’91,
DU’01b, DU’04, JAGA’06, ZHAO’05] (i.e. a distributed computation is carried out
without any assistance from a third party). For the sake of convenience and clarity, in
35
this section, the two participating parties will be known as Alice and Bob in the
descriptive examples where Alice holds dataset X and Bob holds dataset Y .
Trusted-third Party (TTP)
The TTP Model (also known as the Ideal Model in the literature) assumes that there
exists a TTP to whom all the participants could surrender their data [GOLD’98,
GOLD’01, GOLD’04]. By using the data from the participating parties, the TTP
performs the computation and then delivers the final computation result to the parties.
As the TTP is assumed to be fully trustworthy, and will not reveal any data other than
the final computation result to the parties, no party can learn anything which is not
inferable from its own input and the final computational result, thus achieving
privacy-preserving computation. This model provides an efficient way to support
collaborative computation while preserving the privacy of the parties’ respective data.
Figure 1 illustrates this two party collaborative computation using the TTP-based
model.
Figure 1. The trusted third party (TTP) computation model. (Source: Author’s own)
In this illustration, Alice sends her own private dataset X to the TTP and Bob sends
his own private dataset Y to the TTP. After receiving X and Y , the TTP performs
the computation ),( YXf and generates the computation result YXR , . Finally, the TTP
sends YXR , to Alice and Bob, respectively. Provided that the communication channels
between the TTP and Alice (or Bob) are secure, no external entities could gain access
to data sent over these channels. In addition, the TTP has to be fully trust-worthy. At
the end of this computation, all Alice knows is her own private dataset X and the
36
final computation result YXR , ; all Bob knows is his own private dataset Y and the
final computation result YXR , .
However, this model suffers from a number of limitations. Firstly, the TTP must be
fully and unconditionally trustworthy, and must not collude with any of the
participating parties [GOLD’98, GOLD’01, GOLD’04]. If any of these conditions is
violated, or should the TTP be compromised, the privacy-preserving property will be
lost. In real life, it may be impractical to always assume the existence of a party that
can be trusted unconditionally by all the participating parties. For example, would an
organization in one country always trust, and/or be willing to surrender its data to an
organization in another country? Secondly, due to legal and ethical concerns
[CLIF’04, EURO’95, OCR’03, MUSE’08], data providers managing private and
sensitive personal information are under legal obligations not to release raw data to an
external party even if this party is fully trustworthy. Furthermore, it is not scalable to
rely on the use of a single party to perform the computation and to achieve privacy.
As computation loads increases the TTP is more prone to be a performance bottleneck.
More importantly, it is also a single point of failure. If the TTP breaks down, the
computation will no longer be possible. For these reasons, this model may not be
appropriate in some application contexts.
Semi-trusted Third Party (STTP)
An alternative model to the TTP model is to employ a STTP to assist the computation
[ABAD’02, CACH’99, EMEK’06, FEIG’94, FRAN’97, KANT’02, KNAT’03,
BEAV’97, BEAV’98, CACH’00, DICR’98, DU’01b, DU’04, NAOR’99]. In this
model, the third party is not expected to be heavily involved in the computation so the
level of trust on the third party is reduced. It is assumed that the third party does not
have access to private data contributed by Alice or Bob, nor does it access
computational results. It is also assumed that the third party does not collude with any
of the participants and that it will execute the protocol correctly. For these reasons, the
third party is called STTP. Depending on the way the STTP assists the computation,
this model can be further classified into two categories: (1) the Off-line STTP based
computation model [BEAV’97, BEAV’98, CACH’00, DICR’98, DU’01b, DU’04,
NAOR’99] and (2) the On-line STTP based computation model [ABAD’02,
CACH’99, EMEK’06, FEIG’94, FRAN’97, KANT’02, KNAT’03].
37
Using an Off-line STTP
In the Off-line STTP based computation model, the STTP does not actually perform
any computational task. It does not have access to input data from any of the
participating parties; rather, it acts as a commodity server, a program issuer or a
fairness checker. When the STTP acts as a commodity server it provides random
noise values to the parties as commodities [BEAV’97, BEAV’98, DICR’98, DU’01b,
DU’04]. The parties then use these random values to disguise their private data items
before sending them to the other parties. Basically, the STTP can generate the random
values off-line independently, and sells them to the parties as commodities. This
commodity server model was first proposed by Beaver in [BEAV’97, BEAV’98] and
then widely used in addressing various SMC problems [DICR’98, DU’01b, DU’04].
Figure 2 further illustrates the operation of this commodity server STTP model.
Figure 2. An example of the commodity server model. (Source: Author’s own)
As shown in green square of the figure, the commodity server first generates random
noise V and W off-line and sends them to Alice and Bob, respectively. Alice and
Bob then use the random noises to disguise their private dataset prior to the joint
38
computation, as shown in the blue square of the figure. The commodity server does
not participate in the joint computation.
The second type of the off-line STTP model, i.e. the program issuer model, is widely
applied in the field of private auction and bidding [DICR’98]. In this context, the
STTP acts as an auction issuer and is responsible for preparing a program that can
privately compute data inputs from bidders and generate an output for an auctioneer.
As the program preparation can be done in advance and independently from the
auctioneers, the issuer needs neither to interact with the auctioneers nor the bidders.
Therefore, the auction issuer is able to provide multiple auction programs to multiple
auctioneers at the same time. Figure 3 illustrates the operation of this program issuer
STTP model.
Figure 3. An example of the program issuer model. (Source: Author’s own)
As shown in the figure, the server first generates a program off-line then sends it to
the auctioneer, who then uses this program to start an auction process. The interested
bidders submit their private bid to the auctioneer. The issuer is not involved in the
auction process.
39
In the third type of off-line STTP model, i.e. the fairness checker [CACH’00], the
STTP is used to ensure fairness for a secure computation. It does not participate in the
computation if the participating parties are honest and if messages are delivered error-
free. This model is widely used in the fair exchange of message and digital contract
signing applications. Figure 4 illustrates the operation of the fairness checker STTP
model.
Figure 4. An example of fairness checker STTP model. (Source: Author’s own)
In this figure, Alice and Bob first perform a joint computation, on the completion of
which Alice has a result V and Bob has a result W . Alice and Bob then send V and
W to the verifier, respectively. The verifier compares V and W , and then returns the
comparison result to Alice and Bob, respectively. In this model, the verifier does not
actively take part in the computation; it only waits for the computation results from
Alice and Bob, and compares the results to make sure they are the same. This
comparison operation can be performed off-line.
40
Using an On-line STTP
On the other hand, in the On-line STTP based computation model [ABAD’02,
CACH’99, EMEK’06, FEIG’94, FRAN’97, KANT’02, KANT’03], the STTP actually
takes part in the computation. Unlike the TTP based and the off-line STTP models,
here Alice and Bob need to disguise (perturb or encrypt) their respective data, X and
Y , by employing privacy-preserving techniques (data perturbation techniques or
cryptographic primitives) before sending them to the STTP [ABAD’02, CACH’99,
EMEK’06, FEIG’94, FRAN’97, KANT’02, KANT’03]. This approach typically only
requires two message transactions per party: one for sending their encrypted data to
the STTP, and the other for fetching the computational result from the STTP once the
computation is completed. However, in the case when encryption schemes are used,
in terms of computational cost it is more expensive than the commodity server STTP
model. Therefore, the cryptographic primitives are not preferable when Alice and Bob
are computationally weak and/or the protocol efficiency is a major requirement.
Figure 5 illustrates the operations of this model which employs a symmetric
cryptosystem.
Figure 5. An example of the on-line STTP model. (Source: Author’s own)
41
In this example, Alice and Bob first negotiate a symmetric encryption scheme
))(),(,( kk DEk prior to the computation, where k is the encryption key, )(kE is the
encryption function and )(kD is the decryption function. Then Alice encrypts X and
sends XE to the STTP; Bob encrypts Y and sends YE to the STTP. After receiving
XE and YE , the STTP computes ),( YXER and sends it to Alice and Bob, respectively.
Finally, Alice and Bob each decrypts ),( YXER and gets YXR , , respectively. As
))(),(,( kk DEk is only known to Alice and Bob, the STTP is unable to decrypt XE
and YE . The privacy of X and Y is kept from the STTP.
Two-party / Multi-party Model
As mentioned earlier in this section, in real life, it may be difficult to find a third party
that could be trusted by all the participating parties. Under such circumstances, the
computation can only be carried out by the participating parties themselves. This
leads to the so called multi-party model. The two-party model is the simplest form of
the multi-party model [CAST’04, DOLE’91, DU’01b, DU’04, JAGA’06, ZHAO’05].
In this model, a computation is divided into local and global computational tasks. On
the one hand, a local computational task is one that can be performed by one party
independently, i.e. without further interaction with the other party. On the other hand,
a global computational task requires the two parties to interact, to exchange data and
to compute collaboratively. An output of a local computational task may be the input
of a global computational task; an output of a global computational task may be the
input of a local computational task or another global computational task. As these
intermediate computational results contain aggregated information of an individual
dataset (i.e. X or Y ) or the joint dataset (i.e. YX ), their privacy should be
preserved. Such privacy is protected by data perturbation techniques or cryptography
primitives in the literature. Figure 6 illustrates an example of the privacy-preserving
two-party computation model.
42
Figure 6. An example of the two-party computation model. (Source: Author’s own)
In this example, Alice holds a dataset X and Bob holds a dataset Y , where
}...,,{ 1 nxxX and }...,,{ 1 nyyY . They want to know the mean value of the joint
dataset YX while Alice is kept from knowing Y and Bob from knowing X . For
this purpose, Alice first generates a random dataset V where }...,,{ 1 nvvV . Alice
then computes VXX ' and sends 'X to Bob. After receiving 'X , Bob computes
the mean value of 'X and Y , i.e. YXr ,' , and then sends YXr ,' to Alice. Finally, Alice
computes YXr , by computing )2
( 1,'
n
vr
n
i i
YX
and then sends YXr , to Bob. At the end
of the computation, neither Alice nor Bob knows the exact data inputs of the other
party’s dataset.
In real life scenarios, a computation may be conducted by more than two parties, i.e.
the so called multi-party computational model. Through the use of computation
decomposition, a multi-party computational task can always be decomposed into a
number of two-party computational tasks. The whole computation can then be seen as
a combination of a series of two-party computational tasks. Therefore, a solution to a
two-party computational problem can easily be extended into a solution to address a
43
multi-party computational problem. Figure 7 illustrates an example of decomposing a
multi-party computational task into a number of two-party computational tasks.
Figure 7. An example of a n-party computation. (Source: Author’s own)
In this example, n parties want to securely compute the sum of their private data
input, but none of them is willing to share their private data inputs with any of the
other parties. Prior to the computation, they have agreed to do this computation using
the following two-phase method: In the first phase, party 1 calculates 1R by adding a
random noise 1r to its private input 1x , then sends 1R to party 2. Then for
)1(...,,2 ni , party i computes iR by calculating iii rxR 1 and sends iR to
party 1i . After receiving 1nR , party n calculates nninn RrxR 1 and sends nR
to party 1. In the second phase, party 1 takes away 1r from nR , i.e. calculating
11 SrRn , and sends 1S to party 2. Then for )1(...,,2 ni , party i takes away ir
from 1iS by calculating iii SrS 1 and sends iS to party 1i . Finally, party n
calculates nnn SrS 1 , where nS equals to
n
i ix1
. Party n then sends nS to all
other parties. By the end of this computation, each party only knows its own private
data input and the final computation result nS , provided that they do not collude with
44
one another. Figure 7 further shows that this multiparty computation can be
decomposed into a number of two-party computation tasks. In this figure, each blue
rectangle represents a two-party computation task. This n -party computation can be
decomposed to )12( n two-party computations.
2.2.2 Data Partitioning Models
In a distributed computation environment, data inputs are contributed from multiple
independent participants (data holders). The joint dataset contributed from different
participants may form different data models, i.e. a vertically partitioned data model
[AGGA’08, DU’01b, DU’01d, DU’04, KARR’09, LIU’10, REIT’04, VAID’02,
VAID’04, VAID’06], a horizontally partitioned data model [AGGA’08, DU’01b,
DU’01d, DU’04, VAID’06] and a hybrid data model [REIT’04]. These data models
play an important role in the design of privacy-preserving distributed computation
algorithms. The properties of these data models are described below.
Vertically Partitioned Data (Heterogeneous Data) Model
In the vertically partitioned data model, all participating parties hold data of the same
group of data subjects (i.e. data owners), but with different data attributes (i.e.
variables) [AGGA’08, DU’01b, DU’01d, DU’04, KARR’09, LIU’10, REIT’04,
VAID’02, VAID’04, VAID’06]. For example, if the joint dataset consists of the data
attributes of weight, height, blood pressure, job, income and tax, then, in the two-
party vertically partitioned data model, participant Alice may hold values of the
attributes of weight, height and blood pressure, while Bob holds the values of the
attributes of job, income and tax. In other words, a vertically partitioned data model
assumes that multiple parties have data collected from the same set of subjects but
each party only possesses data on different sets of attributes. This data model is also
referred to as a heterogeneous data model in the literature. Figure 8 illustrates a
vertically partitioned data model in a two-party computation.
45
Figure 8. An example of a vertically partitioned data model in the two-party computation. (Source: Author’s own)
Horizontally Partitioned Data (Homogeneous Data) Model
In the horizontally partitioned data model, all participating parties hold data of the
same set of attributes, but collected from different sets of data subjects [AGGA’08,
DU’01b, DU’01d, DU’04, VAID’06]. For example, in a two-party horizontally
partitioned data model, Alice holds group A ’s records while Bob holds group B ’s
records, where both sets of records contain data in relation to attributes of weight,
height, blood pressure, blood type, age and race. Group A and group B are disjointed,
i.e. BA , where is an empty set. This model is also referred to as the
homogeneous data model in the literature owing to the fact that different participating
parties manage data of identical attributes. Figure 9 illustrates a horizontally
partitioned data model in a two-party computation.
46
Figure 9. An example of a horizontally partitioned data model in the two-party computation. (Source: Author’s own)
Hybrid Model - Vertically Partitioned, Partial Overlapping
In real life scenarios a joint dataset may not be purely a vertically partitioned data or a
horizontally partitioned data. In this case, the data partitioning model is called a
hybrid data model. There are different forms of hybrid data partitioning in a
distributed data computational setting. In this section, the most common form of
hybrid data partitioning is described, namely the vertically partitioned, partial
overlapping (V2PO) data model [REIT’04]. In the V2PO data model, datasets
47
managed by different parties may each have different attributes and some of them
may have overlapping items for some of the attributes. Figure 10 gives an example of
the V2PO data model in a two-party computational setting.
Figure 10. An example of the V2PO data model. (Source: Author’s own)
Other data partitioning models can be easily elaborated from the three data models, i.e.
the vertically partitioned, horizontally partitioned and V2PO data models.
2.2.3 Adversarial Behaviours
Apart from the data partitioning models, the behaviour of a participating party also
plays an important part in the design of secure protocols. Specific adversarial
behaviours need particular considerations for privacy preservation. Two adversarial
behaviours are mainly discussed in the literature, namely the semi-honest model and
the malicious model.
48
Semi-honest (honest-but-curious) Model
A semi-honest party, sometimes also referred to as the honest-but-curious party,
would follow the protocol specification correctly, but with the exception that it would
keep a record of the data inputs and intermediate computation results in an attempt to
later derive additional information about the other party’s data. In other words, the
semi-honest party would strictly adhere to the protocol specification; however, it may
use whatever it has cached during the computation to compromise the privacy of other
participants’ data. A formal definition and detailed proof of privacy-preserving two-
party computation using the semi-honest model can be found in [GOLD’02,
GOLD’04] (definition 7.2.1 on page 620 and definition 7.2.2 on page 623). The
author also demonstrates that this definition can be extended to three or more party
computational scenarios [GOLD’04].
To employ the semi-honest adversarial model in a distributed computation, there are
two essential issues which need to be addressed regarding data privacy preservation.
The first issue is how to prevent a party from learning other parties’ inputs. Thus the
first essential facility that should be provided is for each party to disguise its private
data input before sending it for computation. The second issue is how to preserve the
privacy of intermediate computational results. In many cases, a computational task, or
an execution of an algorithm, is performed in multiple computational tasks, each
generating some intermediate results. In this case, the intermediate results should also
be protected as they may contain valuable information on the other party’s data inputs.
This information may lead to the breach of both data provider and data owner. We
remark that although the semi-honest adversarial is much weaker than the malicious
adversarial model (where the party may deviate from the protocol specification
arbitrarily), it is often more realistic.
Malicious Model
In the malicious model, no trust assumptions are made about any of the participants.
Any of them can be malicious: they may not follow the protocol correctly; they may
substitute other parties’ input data; they may cache the received inputs for further
inference; a subset of them may collude with each other to infer other participants’
data and/or to execute an alternative algorithm. Although the patterns of behaviour are
much more complicated than those seen in the semi-honest model, the approaches
49
used here for privacy preservation are similar to those used in that model. It is shown
by Goldreich [GOLD’04] that any privacy-preservation protocol that is secure in the
semi-honest model can be transformed into a privacy-preservation protocol that is
secure in the malicious model. A formal definition and discussion of secure
two/multi-party computation in the malicious model can be found in [GOLD’04]
(definition 7.2.4 on page 628, definition 7.2.5 on page 629 and definition 7.2.6 on
page 630).
2.2.4 Data Privacy Definitions Used in Related Works
Due to specific research problems needing particular privacy concerns, the definitions
of data privacy in literature are wide and varied. For example, in the context of PPDM
[CLIF’02b], data privacy is defined as a two-level property: an individual privacy and
a corporate privacy. The individual privacy refers to the privacy of “personal data”.
Any data that can be linked to a specific individual lies within the domain of personal
data, e.g. salary, capital and medical records of an individual [EURO’95, OCR’03].
On the other hand, the corporate privacy refers to the confidentiality of aggregated
data related to a dataset possessed by a data holder. Examples of aggregated data
include the mean value, variance and the trend of a dataset.
Domingo et al. have proposed a data privacy definition from the database point of
view [DOMI’07, DOMI’09a]. In this definition, data privacy is defined in three
aspects, i.e. the data owner privacy, the data provider privacy and the query client
privacy. The details of this definition are given below [DOMI’07, DOMI’09a].
(1) Data owner privacy (i.e. respondent privacy in the original paper) refers to
preventing re-identification of a data subject (e.g. individual or organization) to
which the records of a database correspond. Usually data owner privacy becomes
an issue only when the database is to be made available by the data provider
(hospital or national statistical office) to third parties, such as research
organizations or government.
(2) Data provider privacy (i.e. owner privacy in the original paper) refers to two or
more participating parties being able to compute queries across their database in
such a way that only the results of the query are revealed.
50
(3) Query client privacy (i.e. user privacy in the original paper) refers to guaranteeing
the privacy of queries to interactive databases in order to prevent query client
profiling and re-identification.
According to the authors of the papers, the issue of data owner privacy is mainly
addressed by the statistical community in the domain of statistical disclosure control
(SDC); this is also known as statistical disclosure limitation (SDL) or inference
control [WILL’96, WILL’01, HUND’10]. Data provider privacy is mainly addressed
by database, data mining and cryptography communities in the domains of PPDM
[AGRA’00] and SMC [LIND’00, LIND’02a, LID’02b]. Query client privacy is
mainly addressed in the domain of private information retrieval (PIR) [CHOR’95].
Some other privacy definitions for specific PPDM problems have also been presented
in the literature. For example, in [PARA’06], the author defines privacy as the
freedom of unauthorized intrusions to the sensitive data. It is dependent on the type of
sensitive data in a dataset that needs to be protected and the applications that use this
dataset. In [BERT’05], the authors consider privacy as the right of a participating
party to be protected from unauthorized disclosure of sensitive data. Such sensitive
data may be stored in a data provider or can be derived as aggregate information from
a dataset stored in the data provider.
According to the above guidelines and discussions, it can be summarized that privacy
definitions vary from one problem context to another and as such are dependent on
the type of data that needs to be protected and the nature of the application that
processes or uses the data. The privacy definition to be used in this thesis will be
discussed and presented in section 3.3.
2.3 Privacy-preserving Distributed Data Computation:
State-of-the-art
This section offers a critical review of the related works in the field of PPDDC. The
review will be delivered in three sections, each covering a line of research in the topic
area: SMC, PPDM and PPDSC. In general, the SMC techniques preserve data privacy
by employing cryptographic primitives. The PPDM solutions focus on transforming
normal data mining algorithms into privacy-preserving ones using either
51
cryptographic primitives or data perturbation techniques or both. Similarly, PPDSC
solutions focus on supporting statistical computation on a joint dataset while
preserving privacy of the joint dataset.
2.3.1 Secure Multi-party Computation (SMC)
Generally speaking, an SMC task deals with computing any probabilistic function on
a joint input contributed by multiple participating parties connected by a network. In
this network, each party holds part of the input. This computation ensures that no
more information other than a party’s own input and output is revealed to the party in
the computation. Research activities in the field of SMC have been extensive since it
was first proposed by Yao in [YAO’82]. Goldreich has observed [GOLD’98] that the
general SMC problem is solvable in theory. However, the solutions derived by the
general results for an individual case of a multiparty computation problem may not be
practical. In other words, for the sake of efficiency, particular solutions should be
developed for specific cases of SMC problems. Goldwasser also suggests that the
SMC will be a research direction that is as important as the cryptography [GOLD’95].
Motivated by these pioneering works, extensive research activities have been carried
out [DU’01c, LI’06, BEND’08]. Works in relation to the SMC problem can largely be
divided into three categories: solutions to Yao’s millionaires’ problem (YMP), private
information retrieval (PIR) and selective private function evaluation (SPFE). This
section provides an intensive review of all these three research lines.
2.3.1.1 Solutions to Yao’s Millionaires’ Problem
The SMC problem was first proposed by Yao [Yao’82] and later elaborated by many
other research works [GOLD’97, GOLD’04]. It is referred to as the YMP in the
literature. It focuses on the comparison between two numbers jointly and securely (i.e.
privacy-preserving) by two parties. This problem considers two parties, Alice and
Bob, who hold two numbers x and y , respectively, where x and y are both integer-
valued numbers with a bounded range. The two parties want to compute if yx ,
without disclosing x to Bob and y to Alice.
Yao’s approach to this problem relies on the use of a public one-way function and a
random prime [YAO’82]. In Yao’s initial solution, the following assumptions were
used: Let M be the set of all N -bit nonnegative integer, and NQ be the set of all one-
52
to-one onto functions from M to M . In the case that Alice has i millions pounds
and Bob has j millions pounds, where 10,1 ji , the way this computation is
performed is that: Firstly, Bob picks up a random N -bit integer x and computes
kxEa )( , where aE is the encryption function of Alice and is generated by choosing
a random element from NQ . Then, Bob sends 1 jk to Alice. Once Alice has
received 1 jk from Bob, she computes ua yujkD )( for 10...,,2,1u ,
where aD is the inverse function of aE . Then Alice generates an appropriate random
prime p of 2
N bits and calculates uu yz (mod p ) for 10...,,2,1u , where all uz
differs from each other by at least 2. Alice then sends p and
)}1(...,),1(),1(,...,,{ 10121 zzzzzz iii to Bob. Bob then looks at the thj
number in )}1(...,),1(),1(,...,,{ 10121 zzzzzz iii , and decides ji if the thj
number in )}1(...,),1(),1(,...,,{ 10121 zzzzzz iii is equal to x (mod p ), otherwise,
ji . Finally, Bob sends the result to Alice. This ensures that Bob learns nothing
about i but is still able to calculate the output.
The millionaire problem was later generalized into a multiparty computation problem
in [BENO’88, CHAU’88 and GOLD’87]. All of these proposed solutions use a
similar methodology [LIND’02b], i.e. the function f to be computed is firstly
represented as a combinatorial circuit, and then the parties run a short protocol for
every gate in the circuit. With this approach, the computation is divided into three
stages: the input stage, the computation stage and the final stage. In the input stage,
each party garbles its data inputs and enters these garbled data inputs to the
computation. In the computation stage, all parties simulate f , gate by gate. The
intermediate computation result of each computed gate is kept secret from all parties.
In the final stage, the final computation result of f is computed and then sent to all
the parties [BENO’88]. Although this approach is attractive in its simplicity and
generality, the protocols it generates are directly related to the size of the circuit. As
this size depends on the data input and on the complexity of the function f , it may
not be practical for real life applications [IOAN’03]. For example, the data input
might be huge in a data mining application and a multiplication circuit is quadratic in
the size of its input. The computation complexity of this approach is )2( nO , which is
exponential to the number of bits of the number involved (i.e. n ) [ATAL’01].
53
However, the communication complexity of this approach is only 3, which is
relatively efficient.
To overcome the computational complexity problem, Cachin [CACH’99] proposed a
solution based on homomorphic public-key encryption, the assumption of hiding
and the use of a semi-trusted third party (STTP). It assumes that the STTP may
misbehave on its own for the purpose of obtaining unauthorized information such as
the private inputs of the other participating parties, but it does not collude with any of
the participants. Cachin’s approach reduces the computational complexity from
)2( nO to )(nO where n is the number of bits of each input number.
Another relatively efficient secure two number comparison protocol was proposed by
Ioannidis and Grama [IOAN’03]. The protocol uses a nofout 1 oblivious
transfer cryptographic primitive and the following assumption: two numbers differ in
the most significant bit (consequently, identical bits do not affect the comparison
results and the effect of low-order bits are overshadowed by higher-order bits)
[LI’08a]. The computational and communication complexities of this protocol are
)(nO and )( 2nO , respectively, where n is the number of bits of each input number.
All of the aforementioned solutions are based on public key cryptography, thus
having relatively high computational costs. More recently, Li et al. [LI’05, LI’06,
LI’08a] proposed a series of solutions using symmetric key cryptography and set-
inclusion theory. Using this solution to compare two numbers x and y , with a
bounded range U , i.e. Uyx ,0 , requires at most U2 times bitwise extended-OR
operations, which is much cheaper and faster than those using public key
cryptography. In addition, this solution can be used to compare both integer-valued
numbers (natural numbers) and non-integer-valued numbers (real numbers), which
makes it more attractive in a broader application context.
Table 1 summarizes the major features of the existing two-number comparison
protocols discussed above:
54
Table 1. A comparison of solutions to the YMP. (Source: Author’s own)
2.3.1.2 Private Information Retrieval (PIR)
A private information retrieval (PIR) scheme is a query-answer interaction between
two parties: a user and a database. It allows a user to retrieve information from a
database while keeping the privacy of the queries from the database. In this scheme,
the data is considered as a n -bit string, x . The user wishes to obtain the bit ix while
keeping the index i private from the database. This research topic was first proposed
by Chor et al. in [CHOR’95] and later extended by many other researchers [KUSH’97,
CHOR’98, CACH’99, OSTR’07, DOMI’08, DOMI’09a, DOMI’09b].
In [CHOR’95], the authors define the PIR problem as follows: There exists a binary
string database nxxx 1 of length n . Identical copies of this database are stored by
2k servers. The user has an index i and is interested in obtaining the value of the
bit ix . To obtain ix , the user sends his query to all of these 2k servers and gets
replies. The query to each server is distributed independently of i , therefore these
servers gain no information about i . In this setting, as the database is viewed as
public information and maybe stored in more than one server, this PIR problem does
not include database privacy or server privacy. Rather, it only focuses on preserving
user privacy. Three solutions to the PIR problem were proposed in this paper. These
solutions make use of database replication and the low-degree polynomial
interpolation method [BEAV’90b]. The first solution provides a solution for two-
databases with a communication complexity of )( 3
1
nO . The second scheme provides
55
a solution for k -databases with a communication complexity of )(
1
knO , where k is a
constant number and 3k . The third scheme, a more elaborate form of the second
scheme, requires )loglog(log 22
2
2 nnO bits for communication when nk 2log3
1 . By
enabling a query user to access k replicated copies of a database ( 2k ) and
privately retrieve information stored in the database, the communication complexities
of these schemes are all significantly less than asking for a copy of x from the server.
(The communication complexity of asking a copy of x from the server is )(nO ). It is
also shown in this paper that a single-database PIR scheme does not exist in the
information theoretic sense.
In 1997, however, Kushilevitz et al. [KUSH’97] demonstrated that a single-database
PIR scheme was achievable by employing a secure public-key encryption scheme.
Based on the quadratic residuosity assumption [GOLD’82], the communication
complexity of this single database computationally-private information-retrieval
scheme is only )( nO for any . The same authors later improved this scheme by
using a trapdoor permutation and a nofout 1 oblivious transfer protocol
[KUSH’00]. Two PIR protocols are designed, one uses an honest-but-curious server
(i.e. PIRHbC) and the other uses a malicious server (i.e. PIRM). In the honest-but-
curious server protocol, given a data string x , the length of x is denoted by n and
the security parameter used in the trapdoor permutation is denoted by K , the total
size of the message sent by the user of the PIRHbC protocol is )(KO bits long. The
total size of the message sent by the server of the PIRHbC protocol is K
nn
2 bits. The
communication complexity of PIRHbC is bounded at )(2
KOK
nn bits. An
improvement on the PIRHbC protocol is also described in this paper, it provides a
solution with )loglog(loglog)1(
nKdOdK
ncndn
bits communication complexity,
where d represents the number of functions that the user picks and c can be any
constant. The communication complexity of PIRM protocol is bounded at
)(6
2KOK
nn bits, where )( 2KO bits of message are sent by the user and the
server sends no more than K
nn
6 bits of message during the protocol execution. A
56
good survey of single-database private information retrieval schemes can be found in
[OSTR’07]
However, as indicated in [DOMI’08, DOMI’09a, DOMI’09b], the PIR solutions
proposed so far have two fundamental shortcomings:
In the case when a database contains n data items and a user wants to retrieve
the thi input in the database. The PIR protocols attempt to maximize server
uncertainty on the index i . Chor et. al. have proved that the computational
complexity for such PIR protocols is )(nO [CHOR’95, CHOR’98]. In such
cases, all data inputs in the database must have been accessed by the protocols,
otherwise the server could easily exclude these untouched records when trying
to discover i . For large datasets, an )(nO computational cost may not be
affordable.
The PIR protocols assume that the database server cooperates with the user
during the protocol execution and the user wants to keep his or her own
privacy from the server. However, the motivation for database servers might
not be only limited to breach user’s privacy in many real life applications. This
has restricted the application of PIR.
A specific application of PIR on a search engine or database query was proposed in
[DOMI’08, DOMI’09a, DOMI’09b]. Three types of adversaries are considered in the
design of the solutions. They are users, databases or search engines, and external
intruders. In this scenario, a user wants to submit a query to the database but would
like to keep the detail of his or her query secret from the database and external
intruders. The aim of the design is to ensure that the query processes are carried out,
while, at the same time, preventing the database and external intruder from knowing
the detail of the query. Two approaches were proposed to address this problem: a
peer-to-peer approach [DOMI’08, DOMI’09a] and a pragmatic approach
[DOMI’09b]. In the peer-to-peer approach, a combinatorial peer structure, i.e. the
),,,( krbv -configuration [STIN’03], was introduced to reduce the required key
material and to increase the point availability in a peer-to-peer user community. A
peer-to-peer UPIR solution was proposed in [DOMI’09a]. This solution employs a
dealer who creates v keys and distributes them into b blocks of size k , each
according to the ),,,( krbv -configuration structure. In the case that 2r , the
57
number of keys and the memory sector required, and the overall number of keys
stored by the users, are less than solutions using complete graph. In this approach, any
query can be submitted which will neither consider keyword frequencies nor swamp
search engines with ghost queries. The level of privacy achieved is proportional to the
connectivity )1( rk of the peer-to-peer community. It also shows that 1)1( brk
achieves the optimal solution when using ),,,( krbv -configuration structure.
In the pragmatic approach, on the other hand, Domingo-Ferrer et. al. proposed a
solution in [DOMI’09b] to protect user query privacy in databases and search engines.
Two protocols (i.e. the Naïve( kq ,0 ) protocol and the Enhanced( ,,0 kupwdq )
protocol) for keywords search have been designed to protect the privacy of user
queries which do not assume any cooperation from the database. Given a query, 0q ,
and a non-negative integer, k , the Naïve( kq ,0 ) protocol protects 0q ’s privacy by
random permuting 0q with 1k bogus queries. The Enhanced( ,,0 kupwdq )
protocol protects 0q ’s privacy by employing a pseudo-random number generator and
a user password “upwd”.
2.3.1.3 Selective Private Function Evaluation (SPFE)
The selective private function evaluation problem was firstly defined in [CANE’01].
It concerns the computation between clients and servers. In this problem, a client
interacts with one or more servers, who hold copies of a database nxxx ...,,1 , in
order to compute )...,,(1 mii xxf , for some function f and indices miii ...,,1 chosen
by the client. Ideally, the client should learn nothing about the database other than
)...,,(1 mii xxf , and the server should learn nothing at all. As the generic solutions are
developed based on the standard techniques for secure function evaluation, these
solutions incur communication complexity at least linear in n . This makes the generic
solutions impractical when dealing with a large database, even when computing a
simple f with small m . The authors of [CANE’01] propose several approaches to
construct sub-linear communication SPFE protocols, both for generic solutions and
special cases of interest. In these approaches, the authors concentrate on the case
where the server learns f and m but not the m locations in the database to which f
is applied. The solutions not only preserve the secrecy regarding the client’s queries
but also prevent the server from revealing a large amount of information to the client.
58
All single-server SPFE protocols proposed in this paper have the communication
complexity that is at least equal to the size of the Boolean circuit that is needed to
implement f . In practice, f is often complex, so that f ’s circuit size is at least
linear to m .
An augmented form of the SPFE problem was defined in [LIMP’09] where the
authors allow the client to have an extra private input y . The client is seeking the
computation result of ),...,,(1
yxxfmii . This enables the SPFE to be applicable in
more applications, for example, the private similarity test application mentioned in
[LIMP’09]. Two generalized protocols are proposed in this paper. The first protocol
only works in the case of constant m . It is particularly efficient when 1m . By
employing the private binary decision diagram (PrivateBDD) protocol [ISHA’07] in
this protocol, the server computes a database of the answer ),...,,(1
yxxfmii for all
values )...,,( 1 mii . The client and server then run a two-message 1-out-of- n
mC
computationally-private information retrieval ( )1,( n
mC -CPIR) protocol [LIPM’05] on
the database so that the client receives ),...,,( 1 yxxf m . This protocol requires
computing f on n different inputs. The authors show that this protocol is efficient in
many cases as the evaluation of the binary decision diagram (BDD) corresponding to
f is efficient for known values of ix . The second protocol works for any value of m .
The client and the server execute an input selection protocol. The client has input
)...,,( 1 mii . The server has )...,,( 1 nxx . The sever first generates random strings
)...,,( 1 mrr , computes mm rxrx ...,,11 and then sends mm rxrx ...,,11 to the client.
It is noted that the client and the server share the inputs )...,,( 1 mxx . The client and the
server then execute a PrivateBDD protocol with the next inputs: the client inputs
),...,,( 11 yrxrx mm ; the server inputs )...,,,( 1 mrrf . The client then retrieves the
output ),...,,( 1 yxxf m . This generalized SPFE protocol has 4 message exchanges. It
requires the computation of )1,(n -CPIR m times and then executes PrivateBDD for
computing f . This protocol has sub-linear communication whenever f has a BDD
with length that is sub-linear in n .
59
2.3.2 Privacy-preserving Data Mining (PPDM)
Data mining is another research area that is aimed at discovering previously unknown
knowledge in a database. It is a process that tries to extract common patterns from a
large size database using methods such as statistics, artificial intelligence and database
management. Modern computer systems are collecting data at an unimaginable rate
and from vast sources, e.g. the Internet; a huge amount of confidential information is
stored and managed by data providers. Researches have also shown that mining
knowledge in a database without any control could compromise the privacy of
individuals as well as the confidentiality of organizations [VERY’04, VAID’06]. This
has increased concerns that data mining algorithms should be designed with privacy-
preserving capabilities, i.e. the protection of the aggregated information on a dataset.
PPDM is a new research direction in data mining, where data mining algorithms are
analysed for any possible violation in data privacy. Its main objective is to design
algorithms that modify or alter the original data using specific techniques so that
private data and confidential information will not be revealed after the mining process.
For this purpose, the main focus in PPDM is considered twofold. Firstly, to prevent
the compromise of data subjects’ privacy, sensitive raw data, e.g. identifiers, identities
and address, should be altered or removed from the original database before being
sent to the computation. Secondly, sensitive information that can be mined from data
mining algorithms should also be hidden or disguised, as this information may be
used to compromise data privacy. The ultimate goal of PPDM is to preserve data
privacy of the original database while allowing the mining process to be carried out.
This goal can be achieved by modifying normal data mining algorithms or designing
new algorithms to equip them with privacy-preserving properties.
This subsection provides a review of the PPDM protocols and algorithms. It first
provides a classification of research issues studied in designing PPDM solutions. This
is then followed by a review of a variety of the techniques that have been developed
and applied in designing PPDM algorithms. Finally, this subsection is concluded with
a summary of the research results in this area.
60
2.3.2.1 Research Issues
A large number of PPDM research results and proposals have been published in the
literature [LIND’00, AGRA’01, BERT’05, VAID’06, WU’07, AGGA’08]. According
to [VARY’04], they can be classified into five focuses: (1) data distribution, (2) the
information to be hidden, (3) approaches for privacy preservation, (4) data mining
algorithms and (5) privacy-preserving techniques.
Data Distribution
The first focus is on the properties of a data model. One method will not fit all cases;
a range of data models requires corresponding data privacy protection methods. Both
a centralised data model and a distributed data model are discussed in the literature.
The distributed data model can be further classified into a horizontally partitioning
data model and a vertically partitioning data model. Earlier research activities were
mainly focused on privacy preservation in relation to a centralized data model. More
recent research activities have been focusing on distributed data models as most
databases are stored and managed on different sites in today’s global digital network.
The Information to Be Hidden
The second focus is on which part of the raw data or aggregated data should be hidden
in the mining process. There are two approaches to the hiding problem: data hiding
and rule hiding. Data hiding refers to hiding the sensitive information in a dataset.
Examples of sensitive information are identity, name and address. The rule hiding, on
the other hand, attempts to remove sensitive information that is derived from the
original dataset by the PPDM algorithm. It is worth noting that a majority of PPDM
algorithms use the data hiding approach, particularly in a distributed database
environment. The rule hiding is mainly used for association rule mining in a
centralized database.
Approaches for Privacy Preservation
The third focus is on the approaches for privacy preservation. The purpose is to
provide high quality disguised data while preventing data privacy from being
compromised. Three approaches are used in the literature, i.e. (1) heuristic-based
approach, (2) cryptography-based approach and (3) reconstruction-based approach.
61
The heuristic-based approach modifies only the selected values in the original dataset,
which helps to minimize the utility loss of a disguised dataset. This approach is
mainly used for centralised databases. The cryptography-based approach preserves
privacy by means of encryption schemes and is mainly used for distributed databases.
The reconstruction-based approach reconstructs the distribution of the original dataset
using randomised dataset. With this approach, the sensitive raw data are hidden by
employing perturbation techniques based on probability distributions.
Data Mining Algorithms
The fourth focus is on the design of data mining algorithms. Most existing PPDM
algorithms are designed for performing data classification, association rules mining
and data clustering. Data classification is the process to find a set of properties that
describe and distinguish data classes. These properties can be used to classify or
predict the class of an unclassified data object. Association rule mining is a process to
discover the patterns and rules in a dataset. Data clustering concerns the problem of
decomposing or partitioning a dataset into some data groups so that the features of a
data input in a group is more similar to other data records in the same data group.
Privacy-preserving Techniques
The fifth focus is on the types of privacy-preserving techniques (also known as data
disguising techniques in some literature) that should be used in designing a PPDM
algorithm. A privacy-preserving technique is used to modify an original dataset before
its release to the computation, in order to protect data privacy. It is essential that the
privacy-preserving techniques should be in line with privacy requirements adopted by
the data providers. Privacy-preserving techniques can be largely divided into two
categories: data perturbation techniques and cryptographic primitives. Data
perturbation techniques include data alteration, data blocking, data aggregation, data
suppression data swapping and data sampling. Cryptographic primitives involve all
cryptographic schemes.
2.3.2.2 Review of PPDM Solutions
A number of privacy-preserving solutions have been developed to address a
corresponding number of data mining problems, for example, clustering, association
rule mining and classification problems. This section reviews these PPDM solutions
62
based on the three major approaches, namely, the heuristic-based, the reconstruction-
based and the cryptography-based approaches.
Heuristic-based Approach
Oliveria et al. have proposed the heuristic-based approach to address the privacy-
preserving frequent item sets mining problem in [OLIV’02]. The focus of this work is
to hide the set of frequent patterns which contain highly sensitive information. A set
of sanitized algorithms are proposed in this paper, which are used to remove certain
pieces of information from a transactional database. An item-restriction method is
used in designing these algorithms. By doing so, the addition of noise data to real data
can be avoided and the removal of real data can be limited. Three quantitative metrics
are introduced to evaluate the proposed algorithms. These three metrics are (1) Hiding
Failure, (2) Misses Cost and (3) Artifactual Pattern. The hiding failure is measured as
the percentage of restrictive patterns that are discovered in the sanitized database. The
misses cost is measured as the percentage of non-restrictive patterns that are hidden
after the sanitization process. The artifactual pattern is measured as the percentage of
discovered patterns that are artefacts. The efficiency of the algorithms is measured in
terms of CPU time. More specifically, three different methods are proposed to
measure the dissimilarity between the original and sanitized database. The first
method is based on the difference between the frequency histograms of the original
and the sanitized databases. The second method is based on computing the difference
between the sizes of the sanitized and the original databases, while the third method is
based on a comparison between the contents of the two databases.
A heuristic-based solution to protect the privacy of raw data through the use of
generalization and suppression techniques is proposed in [SWEE’02b]. The solution
achieves Anonymityk [SWEE’02a]. A release of a database is said to adhere
Anonymityk if each record in the release can not be distinguished from at least
1k other records in the release. It is then called a anonymousk database. By
performing generalization operations on the values of some target attributes, a
database A can be converted into a new database 'A that guarantees the
Anonymityk property for the sensitive data inputs. As a result, in a computation
process of seeking Anonymityk , such attributes are easily influenced by data
distortion owing to the different level of data generation the computation has applied.
63
The concept of “precision” is also introduced in this paper. Given a table T , the
precision represents the information loss incurred by converting the table T to a k -
anonymous table kT . The precision of the table kT is measured as one minus the sum
of all cell distortions, normalized by the total number of cells. It is worth noting that
in the case where generalization techniques are adopted by a PPDM algorithm for
hiding sensitive information, the precision can also be viewed as a measure of the data
quality or the data utility of the released table.
Reconstruction-based Approach
A reconstruction-based technique is proposed in [AGRA’00]. This technique is used
to estimate the probability distribution of original numeric data values in order to
build a decision classifier from a disguised dataset. A quantitative measure is
proposed to evaluate the level of privacy offered by a method and evaluate the
proposed method against this measure. The privacy provided by a reconstruction-
based technique is measured by evaluating how closely the original value of a
modified attribute can be determined. More specifically, the interval that a value x
lies within with c % confidence can be measured and the width of this interval can be
viewed as the level of privacy with a c % confidence interval. The accuracy of the
proposed algorithms is also assessed for Uniform and Gaussian perturbation under a
fixed privacy level. The approach proposed in [AGRA’01] is based on an expectation
maximization (EM) algorithm for distribution reconstruction. The reconstructed
distribution converges to the EM estimate of the origination distribution on the
perturbed data. The metrics proposed in this paper provide a quantification and
measurement of privacy and information loss. The average conditional privacy of an
attribute A , modelled with a random variable B , is defined as )|(2 BAh , where )|( BAh
is the conditional differential entropy of A given B . This information loss measures
the lack of precision in estimating the original distribution from the perturbed data.
The information loss is defined as half the expected value of the 1L -norm between the
original distribution and the reconstructed distribution. Both proposed metrics are
universal so that they can be used in measuring any reconstruction algorithm. This
application is independent of the type of data mining task applied.
A framework for mining association rules from transactions consisting of categorical
items is proposed in [EVFI’02]. This framework ensures that only true association
64
rules are mined when mining on randomized data. A formal definition of privacy
breach and a class of randomization operators are provided in this paper, which offers
a more efficient way of limiting privacy breaches than the uniform randomization.
The privacy breach is represented as the counts of the occurrences of an itemset in a
randomization transaction and in its sub-items in the corresponding non randomized
transactions. The item causing the worst privacy breach is then chosen from all sub-
items of an itemset. The worst and the average of this breach level are computed over
all frequent itemsets for each combination of transaction size and item size. The item
size giving the worst value for each of these two values can therefore be selected.
Cryptography-based Approach
A cryptography-based technique is proposed in [KANT’02]. This technique deals
with the problem of secure mining of association rules over horizontally partitioned
data. Cryptographic techniques are employed in this technique to minimize the
information shared. This approach assumes that each party first encrypts its own
itemsets using commutative encryption, then itemsets from all the other parties. An
initiating party then transmits its frequency count, plus a random value, to its
neighbour. After receiving this message from the initiating party, the neighbouring
party also adds its frequency count to this message, which it then passes on to the
other parties. Finally, the initiating party and the final party collaboratively perform a
secure comparison to determine whether the final result is bigger than the threshold
plus the random value. The proposed methods are evaluated in terms of
communication and computation costs. The communication cost is measured using
the number of message exchanges among the parties. The computation cost is
measured using the number of encryption and decryption operations required by the
algorithm. Another cryptography-based approach is presented in [VAID’02], which
addresses the problem of association rule mining in vertically partitioned data. In
particular, it aims at determining the item frequency when transactions are distributed
across different sites, while preserving the contents of each transaction. A security and
communication analysis is also performed in this paper. The security feature of this
protocol for computing the scalar protocol is analyzed. The total communication cost
of this solution depends on the number of candidate itemsets.
65
2.3.2.3 Summary of PPDM Research Results
Figure 11 summarizes the existing efforts and approaches in the field of PPDM
algorithm designs based on the aforementioned discussions.
Figure 11. A taxonomy of the developed PPDM algorithms. (Source: Author’s own)
2.3.3 Privacy-preserving Distributed Statistical Computation
(PPDSC)
A PPDSC technique is aimed at supporting a joint statistical computation by multiple
parties without compromising the privacy of individual datasets managed by
respective parties. Existing PPDSC solutions can be largely divided into three
categories, the statistical disclosure control (SDC), secure matrix product computation
(SMPC) and secure linear regression computation (SLRC). In general, the SDC
technique preserves data privacy by pre-processing the dataset before releasing it. The
SMPC technique serves as a building block for large dataset computation by
preserving data privacy through the use of matrix product techniques. The SLRC
technique transforms normal regression computation algorithms to privacy-preserving
66
ones in different research contexts. The details of these three techniques are given
below.
2.3.3.1 Statistical Disclosure Control
SDC is concerned with how to prevent the identification of an individual in a dataset
and the disclosure of confidential information about the dataset through the use of
statistical techniques [Will’96, WILL’01]. A formal definition of statistical disclosure
can be found in [ELLI’05]: “The revealing of information about a population unit
through the statistical match of information already known to the revealing agent (or
data intruder) with other anonymised information (or target data set) to which the
intruder has access, either legitimately or otherwise.” As a released dataset can always
be collected and stored by the data collector and then be analysed to retrieve other
information from the dataset, the disclosure of confidential information about the
released dataset is a potential problem after data dissemination. The purpose of SDC
techniques is to reduce the risk of information disclosure in disseminated datasets.
The SDC methods focus on restricting the amount of or modifying the dataset before
it is disseminated. They can be applied to two types of data: microdata and tabular
data. Microdata consists of a series of records, each containing information on an
individual unit such as a person, a company or an organisation. The simplest form of a
microdata can be represented as a single data matrix, where the rows correspond to
data subjects and the columns to the variables. Microdata is the basis for tabular data.
Tabular data is the aggregated information on data subjects presented in tables.
Frequency and magnitude data are examples of tabular data. Tabular data are
normally directly processed for statistical confidentiality. According to our literature
survey, five techniques are used to assist the disclosure control [WILL’96, WILL’01,
SHLO’07, FAYY’10, ELLI’05]: recording, cell suppression, rounding,
masking/blurring and data swapping.
Recoding is a tool that is commonly used in disclosure control solutions [HURK’98,
ELLI’05, CAST’10]. The idea of recoding is to take the raw data and re-categorise a
variable. By using this technique the variable’s attributes with lower frequencies can
be merged. A typical example is age; a single age attribute can be recoded to the age
group of 5 or 10 years. Table 3 illustrates the result of attribute recoding of Table 2. It
is also common to use this technique for income or occupation data, for example,
67
specification occupations or higher income bands can be grouped together. A special
case of recoding is top-recoding; in this case, frequencies toward the top of the
variable range are likely to be less. Table 4 illustrates the result of top-recoding of
Table 2. A common application of this technique is global recoding which involves
applying all recode universally across a file. On the other hand, localized recoding is
preferable when a variable is highly correlated with location. The advantage of
recoding is that the impact on data quality is visible. However, it changes the table
structure; as the loss of information affects the entire dataset, the benefit may be
relatively small regarding risk reduction.
Table 2. An example table of frequency count for data subjects whose age is between 1 to 10 and who live in area A1 to A4. (Source: Author’s own)
Table 3. An example of attribute recoding. (Source: Author’s own)
68
Table 4. An example of top-recoding. (Source: Author’s own)
Cell suppression, a technique used to control disclosure in a tabular or aggregated data
[SALA’06, LI’08b, CAST’10], simply means leaving this cell blank. The use of this
operation is more effective when a cell is classified as sensitive. Normally this
happens when a cell has a low or zero count and is more likely to cause cell value
disclosure. A weakness of cell suppression is that it is insufficient to suppress only the
sensitive cells, other suppressions on non sensitive cells are also needed. For example,
if only cells with small range of values are suppressed, these suppressed cells would
be identified as having a small range of values. Therefore, it would also be necessary
to suppress some other non-sensitive cells to disguise any patterns in the values of
sensitive cells and the resulting table may have less analytical data items. Given that
Table 5 is the original data which has sensitive data item in cell ),( BC , Table 6
illustrates the cell suppression result of Table 5.
Table 5. A table with sensitive cell (C,B). (Source: Author’s own)
69
Table 6. A table with sensitive cells suppressed and further complementary suppression made. (Source: Author’s own)
Rounding is a technique used to disguise the exact frequency count of a cell by
rounding every cell in a table to a given number base [SALA’06, LI’08b, CAST’10].
The base is typically set to 3, 5 or 10. In other words, rounding simply means that the
value of the data item in a cell (i.e. cell value) is rounded to one of the two closest
integer multiples of the rounding base. This technique is normally applied to
aggregate data. Table 7 illustrates the result of Table 5 where the cell values have
been randomly rounded to base 5. In this table, cell counts are rounded either up or
down randomly. One problem with this technique is that, under certain circumstances,
cell counts can not add up to the subtotals and total of the table, which makes
rounding very difficult. A more relaxed form of rounding is to round the cell values in
such a way that the cell values of the rounded table are not too far removed from the
original values. Even when the additivity is allowed, the controlled rounding can still
cause certain problems. For example, if multiple separate overlapping tables are
rounding separately, the common marginal cells would be rounded to different values;
this further reduces accuracy of the rounded table.
Table 7. A table with cell values been randomly rounded to a base of 5. (Source: Author’s own)
70
Masking /blurring techniques preserves data privacy by adding noise to the data to be
protected [ELLI’05, SHLO’07, FAYY’10]. For tabular data, this can be achieved by
adding/subtracting random integers to/from cell values. For microdata, this can be
done by simply changing the data value to other values. One advantage of these
techniques is that even if a large amount of data on a specific individual is gathered by
an intruder, the precise information of the individual is still preserved by the noise
signals. In general, the larger the variances of the noises are, the better the provision
of the disclosure protection.
Data swapping applies a sequence of elementary swaps to a dataset [DOMI’01,
ELLI’05, SHLO’07]. This execution of data swapping consists of two tasks. The first
task is to randomly select two data records in the dataset; for tabular data, this task
may be performed by randomly selecting two rows or columns in the table. The
second task is to perform the swapping of the selected data records; for tabular data,
this task is performed by swapping the selected rows or columns. This technique
performs swapping of records in a dataset/table. More specifically, if this swapping is
performed on some potential key variables or sensitive information, a better data
protection can be provided than on normal data records. For example, the swapping of
key variables and sensitive data records would harden the intruder’s data matching
task and make the matching result inaccurate. It is usually preferable to swap records
that share the same value with other variables, as this leads to less data distortion and
reduces the risk of inconsistent value combination.
2.3.3.2 Secure Matrix Product Computation
In a cooperative computational environment, the data provided by multiple data
holders may be horizontally partitioned, vertically partitioned or hybrid data. Owing
to the difficulty in expressing these data types clearly and concisely, and the fact they
often need multi-dimensional expressions, computations involving the use of any of
the above three data types are much more complex than computing equations using
one-dimensional values. Secure matrix product protocols [DU’01b, KARR’09a]
integrate cryptographic primitives (e.g. oblivious transfer protocol and homomorphic
encryption method) with linear algebra theories to compute such types of data and to
compute their mean values with privacy preservation. There are two major types of
secure matrix product protocols: the secure scalar product protocol and the secure
shared scalar product protocol. A secure scalar product protocol allows two parties,
71
Alice with her private input }...,,{ 1 nxxX , and Bob with his private input
}...,,{ 1 nyyY , to jointly compute the scalar product
n
i ii yxYX1
, securely
without revealing X and Y to the other party. In other words, after the protocol
execution, Alice learns no more information other than x and YX , and Bob learns
no more information other than y and YX . A secure shared scalar product protocol
allows two parties, Alice and Bob, to compute equation YXss BA with privacy
preservation. After the protocol execution, Alice receives As and Bob receives Bs ,
such that YXss BA . Both of the above approaches have found important
applications in developing secure solutions to distributed multiparty computation.
A number of secure scalar product protocols and secure shared scalar protocols have
been proposed in the literature [LUO’05, DU’01a, DU’04, LUO’03, KARR’09a,
DU’02, DU’01e, GOET’04, SHEN’07, WANG’09]. Du et al. proposed an invertible
matrix and a commodity-based approach [DU’04]. The former enables the trade-off
between efficiency and privacy while the latter, based on Beaver’s commodity model
[BEAV’97], achieves some performance gain by sacrificing a certain degree of
security. The secure two-party scalar-product protocol proposed by Goethals et al.
[GOET’04] relies on the intractability of the composite residuosity class problem to
achieve security. Luo et al. [LUO’03, LUO’04, LUO’05] extended Du’s secure two-
party scalar product protocol to three-party scenarios and developed a set of real
product protocols: a real product protocol, two add-product protocols, a division
protocol, a exponential function protocol, a power function protocol , a logarithmic
function protocol and a trigonometric function protocol. These protocols can be used
to resolve several specific problems, such as secure exponential function computation
problem, secure power function computation problem and secure logarithmic function
computation problem. All of these protocols are designed based on the secure scalar
product protocol, which sheds some light on finding secure solutions to support
distributed multiparty computation.
2.3.3.3 Secure Linear Regression Analysis
Regression analysis is often used to identify patterns, and/or make predictions based
on available datasets. For example, a French diabetic research institute and a British
diabetic centre would like to find if there is any correlation between specific medical
conditions and factors such as education, sex, age, income and career between the two
72
sets of patients under their respective managements using regression analysis method.
In addition, they would also like to identify future trends in some of these factors in
order to take preventive measures against diabetes and other related diseases. Owing
to the privacy concerns and legal responsibility, they can not reveal their patients’
records to the other party. In this case, privacy-preserving regression methods can be
used to find hidden trends and other valuable information related to this problem.
The design of a privacy-preserving regression protocol largely makes use of secure
matrix operation protocol and secure scalar product protocol. To date, a number of
such protocols have been proposed [DU’01b, DU’01d, LUO’03, DU’04, KARR’04,
REIT’04, KARR’05a, KARR’05b, KARR’06, KARR’09a]. The most notable ones
are the secure simple linear regression protocols [LUO’03, DU’01b, DU’01d,
KARR’06], secure multivariate regression protocol [DU’04] and secure regression
protocols for vertically partitioned and partially overlapped data [KARR’04, REIT’04,
KARR’09a]. Among these works, the secure simple linear regression protocol
designed by Du et al. [DU’01d] uses the secure scalar product protocol to address the
linear regression inference in a two-party homogeneous data model. The secure
multivariate linear regression protocol proposed in [DU’04] uses the matrix product
protocol and the commodity–server solution. Built on the secure matrix protocol,
these solutions can securely compute matrix inverse, matrix determinant and norm.
Furthermore, the paper [DU’04] allows two parties to evaluate more complicated
mathematical expressions than merely computing the matrix product or dot product
operations. Karr et al. [KARR’05a] proposed a framework to address the secure linear
regression problem in a cooperative environment in which protecting the sources of
data records is the primary concern. The framework was developed by using local
computation and secure summation protocol. With this framework, the computational
process is as follows. First, it uses the additivity property of the linear regression
model to compute the regression coefficient. At the second step, the framework
supports the use of two approaches. The first approach applies local computation and
secure summation to compute and exploit additivity of several statistics to diagnose
the corresponding statistical model (secure data integration model and securely shared
local statistics model) [KARR’05a]. The second approach generates synthetic
residuals to perform model diagnostics while preserving the relationships among
predictors and residuals. More specifically, the solutions to secure two-party
computation of simple regression analysis were designed by employing scalar product
73
techniques [DU’01a, DU’01b, DU’01c, LUO’03, KARR’05b]. In [DU’01a, DU’01b],
the computation of variance and correlation coefficient were discussed regarding both
heterogeneous data model and homogeneous data model. The Secure Two-party
Statistical Analysis Protocol in Heterogeneous Model securely computes the
correlation coefficient r and the slope of the simple linear regression line b , where
Alice has a dataset )...,,( 11 nxxD and Bob has a dataset )...,,( 12 nyyD , while
Division Protocol and the Secure Two-party Statistical Analysis Protocol in
Homogeneous Model compute r and b , where Alice has dataset
)),(...,),,(( 111 kk yxyxD and Bob has dataset )),(...,),,(( 112 nnkk yxyxD . Luo
[LUO’03] further extended Du’s protocol [DU’01d, DU’04] to address the secure k-
party regression problem in a homogeneous model using secure real product and
division protocols. A solution was designed to securely compute
x ,
y , r and b ,
where party1 has )),(...,),,((11111 ss yxyxD , party2 has
)),(...,),,((2211 112 ssss yxyxD , …, partyk has )),(...,),,(( 11 11 kkkk ssssk yxyxD
. A
similar solution was found in [KARR’05b] to perform secure statistical analysis on
distributed chemical databases, which has demonstrated a real life application of this
solution.
Solutions to multivariate linear regression analysis were studied and presented in
[DU’04, REIT’04, KARR’06]. In [DU’04], the authors consider the joint dataset
M as a )1)(( mnN matrix, where
NmnNNN
mnn
nNN
n
y
y
xx
xx
xx
xx
M
1
,1,
,11,1
,1,
,11,1
.
Each column of M is represented as an attribute, so there are 1mn attributes in
M . Each row of M represents a data subject’s values of these attributes. It is further
defined that
mnNNN
mnn
nNN
n
xx
xx
xx
xx
X
,1,
,11,1
,1,
,11,1
and
Ny
y
Y 1
. To find the
regression coefficient of XY that best fits M is the purpose of this
computation. [DU’04] considers a two-party, vertically partitioned data model, where
Alice has
nNN
n
xx
xx
A
,1,
,11,1
, and Bob has
mnNNN
mnn
xx
xx
B
,1,
,11,1
. Several
74
protocols were proposed to support intermediate computational tasks in a multivariate
regression computation while preserving the privacy for A and B . The techniques of
matrix product computation, inverse matrix computation and random noise generation
were used for hiding A and B . The intermediate computation results were also
disguised by random noise, so that they can not be used to infer useful information.
The authors also demonstrate that this secure multivariate regression solution can be
applied to build multivariate classification model without disclosing the raw data.
An interesting extension of the secure multivariate regression computation can be
found in [REIT’04]. This paper considers a special data partitioning model: the
vertically partitioned, partially overlapping data, where multiple parties hold datasets
for different variables; some of the variables, however, are overlapped. These parties
would like to find the regression function that best fits their joint dataset, but none of
them is willing to share their raw data with other parties. This paper addresses
problems for datasets that are multivariate normal and multinomial distributions. The
expectation maximization algorithm was used as a building block in this paper. The
parties only compute and share the sufficient statistics required for the building block,
rather than sharing individual data values of their respective dataset. A framework for
secure linear regression and statistical analyses in a cooperative environment was
proposed in [KARR’06]. It summarized the research results regarding secure linear
regression related computation problems under various forms of data partitioning
models.
2.4 Identification of the Research Gap
From the above literature reviews and related work analyses, we have observed that,
despite the increasing number of emerging needs for secure computation of a number
of statistical algorithms, only a subset of these algorithms have actually been
converted into privacy-preserving solutions, i.e. the statistical disclosure, the matrix
product computation and the linear regression analysis. Most of the privacy-
preserving statistical computation problems are yet to be explored. Hypothesis tests,
factor analyses and nonlinear regressions are examples of the statistical algorithms
that find applications in privacy-preserving collaborative computation environments,
but have yet to be transformed into privacy-preserving protocols. To address this gap,
this thesis, as our initial effort, investigates security methods in search of an optimum
75
solution to privacy-preserving distributed statistical computation problems. The
nonparametric sign test (NST) is chosen as a case for study. Through the
demonstration of how to convert a nonparametric hypothesis test (NST) algorithm
into a privacy-preserving protocol that could support privacy-preserving distributed
statistical computation in a cost-efficient manner, a four-phase methodology has been
developed. This methodology can be used to transform any other statistical
computation into privacy-preserving ones.
Most importantly, to transform a conventional computational algorithm into a secure
and cost-efficient protocol, the algorithm will first be examined and divided into local
and global computation tasks. The results from a local computational task or a global
computation task are so-called intermediate computational results that will need
privacy protection. A global computational task can lead to the identification of
interactions among multiple parties. The messages used in these interactions can be
intermediate computational results and/or original input from the sender. The original
input may contain raw data of the original dataset and the intermediate computational
results may contain aggregated information of the joint dataset.
How to protect the privacy of these original input data, intermediate computational
results and/or interaction messages is a challenging task which is dependent on the
nature and characteristics of the computation algorithms. In addition, there are a wide
range of privacy-preserving building blocks that can be used to preserve data privacy.
These building blocks may have different security capabilities and impose
corresponding storage and computational requirements. Which set of building blocks
should be chosen and how to apply them to best achieve privacy-preserving
collaborative computation in a cost efficient manner is the focus of this research. For
proof of this concept, the NST algorithm is chosen as the case for study. The ultimate
goal of this research is to find an optimum solution to the PPDSC problems.
2.5 The Best Way Forward
A systematic approach has been employed in the design of our solutions to privacy-
preserving nonparametric sign test (P2NST) computation. This approach can be
divided into the following steps:
76
To investigate research gaps in the literature.
To investigate data privacy definitions in the literature. Thus the properties of
sensitive information and why this information needs to be protected can be
understood.
To define data privacy in the P2NST computation.
To investigate privacy-preserving primitives in literature. The properties and
capabilities of privacy-preserving primitives in the literature can then be learnt.
To apply the most appropriate privacy-preserving primitives in the design of
our P2NST solution in search of an optimum cost-efficient solution.
To implement and evaluate the designed solution.
To conclude the lesson learnt from this research.
2.6 Chapter Summary
This chapter has presented an intensive overview of the-state-of-the-art methods in
the privacy-preserving distributed computation. The terminologies and definitions that
are used in the field of PPDDC were first introduced and defined. Following an
investigation into the privacy concerns in both legal and research areas, the privacy-
preserving solutions that were proposed in the related works in the literature were
studied. This began with SMC solutions, which is the origin of this research topic,
followed by studying solutions in the area of PPDM, which is a new research area that
applies the secure multiparty computation techniques to knowledge discovery.
Finally, the PPDSC solutions were studied. From these reviews, the gap in current
research has been identified. This chapter then outlined our ideas in designing
solutions to the P2DNST computation. Before describing the design of our proposed
solutions to the problem identified, the next chapter provides the design preliminaries
and evaluation method of this research work while chapter 4 presents the privacy-
preserving building blocks to be used in the design of our solutions.
77
Chapter 3 Design Preliminaries and Evaluation Method
3.1 Chapter Introduction
This chapter outlines the design preliminaries and evaluation method of the solutions
presented in this thesis. In detail, the data privacy definitions used in the design of our
solutions are described in section 3.2. Section 3.3 provides the decomposition and
analysis of the original nonparametric sign test (NST) algorithm. Section 3.4 specifies
the design requirements for the privacy-preserving solutions. The evaluation strategy
of the designed solutions is drawn up in section 3.5 while section 3.6 details the
simulation method. Finally, section 3.7 summarizes this chapter.
3.2 Definition of Data Privacy
Based on the privacy definitions in [CLIF’02b, GOLD’04], this thesis makes use of a
three layer privacy definitions: individual data confidentiality, individual privacy and
corporate privacy.
Definition 3.1: Individual Data Confidentiality
Individual data confidentiality refers to preserving the secrecy of an individual data
item so as to ensure that the data item is only accessible to the data provider that
manages the data item.
Definition 3.2: Individual Privacy
Individual privacy refers to the privacy of a data subject. Let us assume that data
items are collected from a data owner (i.e. a subject) by two or more data (or service)
providers (e.g. Alice and Bob). Preserving individual privacy means that, upon a
successful computation of function, f , among multiple data providers (e.g. Alice and
Bob), information in relation to the subject should not be disclosed to parties other
than the managing party of the subject, i.e. Alice or Bob.
The meaning of keeping a data item in private is twofold. Firstly, it means preserving
the confidentiality of the data item. Secondly, the data subject to whom the data object
is associated should not be identifiable from the data item. In other words, during a
78
computational process, the data item should be protected from access by any party
other than the managing party itself. In the event of a protected data item being
revealed, it can not be used to identify the owner/data subject of the data item. More
formally, assuming that },...,,{ ,2,1, kiiii ssss , where is represents subject i ’s data
itemset consisting of k different items (or objects) and jis , is subject i ’s thj data
item, where kj ,...,2,1 . To preserve the data privacy, the real value of jis , should
be prevented from being revealed (i.e. this preserves the data confidentiality of a data
item), and furthermore, even if the confidentiality of jis , is compromised, this will not
be sufficient for a perpetrator to understand that the owner of this data is subject i (i.e.
this preserves the individual privacy in relation to a data item).
Definition 3.3: Corporate Privacy
Corporate privacy refers to the privacy of data providers’ local statistics, which
contain various levels of aggregated information of the dataset. That is, upon the
computation of a function, f , information in relation to a participating party’s local
statistics, for example, the mean value or the median of },...,,{ ,2,1, kiii sss , should not be
revealed to any other party than the contributor itself.
Quantification of Data Privacy
As the solutions designed in this thesis are developed based on data perturbation
techniques, such as data randomization, data swapping and data transformation
techniques, we use the metric proposed in [AGRA’01] to quantify the security level of
our solutions. This metric is designed based on the work of [AGRA’00] and
Shannon’s information entropy theory [SHAN’48].
Shannon defines the concept of entropy as follows: Let X be a random variable
which takes on a finite set of values according to a probability mass function )(xp .
Then, the entropy of the random variable X is defined as follows: In discrete cases,
))((log)()( 2 xpxpXh . In continuous cases, dxxfxfXh ))((log)()( 2 ,
where )(xf denotes the probability density function (pdf) of the continuous random
variable x . It has been widely used in the literature that )(Xh is a measure of
uncertainty in the value of X [AGRA’01]. For example, for a random variable U
79
uniformly distributed between 0 and a , the entropy is )(log)( 2 aUh . Entropy
represents the information content of a datum, the entropy after data sanitization
should be higher than the entropy before the sanitization.
The authors of [AGRA’01] use the property of measure of uncertainty of entropy to
measure the level of privacy. It defines that given a random variable A with
probability density function )(afA , the differential entropy )(Ah of A is described
as follows: daafafAh AAA))((log)()( 2 , where
A is the domain of A . The
authors further propose a measure of privacy of the random variable A as
)(2)( AhA i.e. aAaAh
)(log)( 222)( . Here )(A denotes the length of an
interval over which a uniformly distributed random variable has the same uncertainty
as A . This quantifies the level of privacy by means of its uncertainty.
3.3 The NST Computation
In this section, the NST computation is used as an exemplar algorithm to demonstrate
how to transform a statistical algorithm into a distributed privacy-preserving one such
that multiple mutually distrustful parties can compute the algorithm on their joint
dataset without compromising their respective privacy.
The NST algorithm will firstly be described and then transformed into one that can be
executed in an ideal Trusted Third Party (TTP) based model (here after called the
TTP-NST algorithm). Through this transformation process, it is possible to identify
the computational elements (or segments) that could be performed by an individual
party itself without any input from, or the involvement, of other parties (i.e. the so
called local computation), and from the elements that require inputs from, or the
participation of, other parties (i.e. the so called global computation). In addition, all
intermediate results (i.e. those are generated from earlier stages of a computation and
will be used to compute the next intermediate result or the final result) will also be
identified and singled out from the final result (that is the final outcome of the
computation).
Based upon the nature of these local/global computation tasks and intermediate/final
computational results, we can then investigate, identify or design the most appropriate
privacy-preserving techniques or protocols and use them to support the NST
80
computation without the involvement of the TTP. This can be achieved while
simultaneously preserving the privacy of participating parties’ respective inputs and
intermediate computational results. In other words, our main task is to seek and
design privacy-preserving primitives and protocols and use them to replace the
involvement of the TTP so as to transform the TTP-NST algorithm into a privacy-
preserving distributed NST algorithm (here we call it P2DNST). By using the
P2DNST algorithm, the parties should learn no more than if they were using the TTP-
NST model.
Two data partitioning models are considered in the protocol design, i.e. the vertically
partitioned (heterogeneous) data model and the horizontally partitioned
(homogeneous) data model. In the vertically partitioned data model, Alice has
}...,,{ 1 nxxX and Bob has }...,,{ 1 nyyY . The sign test computation using this data
partitioning model involves pairwised comparison for each ),( ii yx . Alice and Bob
will need to send their respective private datasets into the computation in order to
compute the sign test result. During the course of the sign test computation,
}...,,{ 1 nxxX and }...,,{ 1 nyyY are first compared pairwised. A set of
intermediate results is then generated, i.e. ),,( RQP , where P is the number count of
ii yx , Q is the number count of ii yx and R is the number count of ii yx .
),,( RQP is then used to compute the sign test result.
On the other hand, in a horizontally partitioned data model, Alice holds a dataset
)},(...,),,{(1111 nn yxyxA of 1n data subjects and Bob holds a dataset
)},(...,),,{(2211 nn yxyxB of 2n data subjects. For computing sign test, Alice can
calculate AP , AQ and AR from A , where AP is the number count of ii yx ; AQ is
the number count of ii yx ; AR is the number count of ii yx , where 1,...,1 ni .
Likewise, Bob can compute BP , BQ and BR from B , where BP is the number count
of ii yx ; BQ is the number count of ii yx ; BR is the number count of ii yx ,
where 2,...,1 ni . In this data model, Alice and Bob only need to send ),,( AAA RQP
and ),,( BBB RQP , respectively, to the computation in order to perform the
collaborative sign test computation. This computation is simpler than that of a
vertically partitioned data model as it only needs intermediate results from Alice and
Bob, respectively. These intermediate results do not contain any private information
81
regarding individual data confidentiality and individual privacy. Its solution can be
easily deduced from the solution of the vertically partitioned data model solution.
This thesis focuses on the design of solutions to vertically partitioned data model. The
horizontally partitioned model case will be left as our future work.
3.3.1 The NST Computation Problem
Assuming that Alice has a dataset }...,,,{ 21 nxxxX and Bob has a dataset
}...,,,{ 21 nyyyY , and X and Y are dependent and paired datasets, where ix and iy
are both generated from subject i , for ni ...,,1 . To start with, we also assume each
object only generates a single data pair and contributes to Alice and Bob respectively.
In this case, X and Y are vertically partitioned (or homogeneous) datasets. Both
Alice and Bob want to know if there is any difference between }...,,,{ 21 nxxxX and
}...,,,{ 21 nyyyY by performing NST, under the -significance level. Formally,
this is a hypothesis test problem with respect to a null hypothesis 0H , for example,
0H : The two population distributions are identical and 5.0)Pr( ii yx for any pair;
versus an alternative hypothesis 1H , for example, 1H : The two population
distributions are not identical and 5.0)Pr( ii yx for any pair, under the -
significance level.
3.3.2 The TTP-NST Algorithm
As neither Alice nor Bob is willing to reveal any value of their respective datasets to
the other party during the computation, they would like to perform this computation
with privacy protection. One way of achieving this is through the use of a TTP. In this
TTP based model, Alice and Bob send their respective datasets to the TTP, and the
TTP uses the data to perform the computation, and then sends the outcome of the
computation to Alice and Bob, respectively. As the TTP is the centre of the
computation, the level of privacy protection afforded in this model lies in the
trustworthiness of the TTP. If the TTP is fully trustworthy, and if it only delivers the
final result to the parties, it is clear that nobody could learn anything that is not
inferable from its own input and the final result, assuming that such a TTP does exist.
The operation of this computation is shown in Figure 12 and the computation process
is expressed in pseudo code in Figure 13.
82
Figure 12. The TTP-NST computation. (Source: Author’s own)
TTP-Based Nonparametric Sign Test (TTP-NST) Algorithm
Input
I-1) An value is negotiated by Alice and Bob for performing a -level sign test.
I-2) A sample size criteria 25Z that is also negotiated by Alice and Bob according to the
central limit theorem [ROSE’97].
I-3) A null hypothesis 0H is negotiated by Alice and Bob.
I-4) Alice’s dataset X , }...,,,{ 21 nxxxX .
I-5) Bob’s dataset Y , }...,,,{ 21 nyyyY .
Output
O-1) TTP computes the test result: “Reject 0H ” or “do not Reject 0H ”.
O-2) TTP sends the test result to Alice.
O-3) TTP sends the test results to Bob.
% Execution
-- Stage_1 Alice and Bob send their input, including their respective datasets, to the
TTP. --
(1) Alice sends }...,,,,,{ 21 nxxxz to the TTP. // is the significant level, z is the
sample size criteria and ix is the input contributed by subject i , for ni ,...,1 .
83
(2) Bob sends }...,,,,,{ 21 nyyyz to the TTP. // is the significant level, z is the
sample size criteria and iy is the input contributed by subject i , for ni ,...,1 .
-- Stage_2 TTP calculates intermediate results P, Q, R and N. --
(3) For 1i to n , TTP compares ix and iy .
if ii yx , then TTP generates }0,0,1{},,{ iii RQP ;
else ii yx , then TTP generates }0,1,0{},,{ iii RQP ;
else ii yx , then TTP generates }1,0,0{},,{ iii RQP ;
end if,
end.
(4) TTP calculates
n
i iPP1
. // P is the number counts of positive signs.
(5) TTP calculates
n
i iQQ1
. //Q is the number counts of equal value.
(6) TTP calculates
n
i iRR1
. // R is the number counts of negative signs.
(7) TTP calculates QnN . // N is the sum of number counts of positive signs and
negative signs.
-- Stage_3 TTP calculates intermediate results 0T , CR, 1T and 2T . --
(8) TTP calculates },min{0 RPT .
if RP , then let RT 0 ;
else RP , then let RPT 0 ;
else RP , let test statistic PT 0 .
(9) TTP calculates the critical value CR according to N , Z and .
if ZN , then 2
CR ;
else ZN , then find CR from Standard Normal Distribution Table according to N
and .
(10) TTP calculates the test statistic T ,
84
if ZN , then let test statistic NT
i
T
i
NT
iiTi
TCTT )
2
1(
)!(!
!)
2
1(
00 0
00
0
01 ;
else ZN , let test statistic
2
2)5.0( 0
2N
NT
TT
.
-- Stage_4 TTP performs the Sign Test computation --
(11) TTP compares T and CR ,
if CRT , then reject 0H ;
else do not reject 0H .
-- Stage_5 TTP transmits the final computational result to Alice and Bob, respectively.
--
(12) TTP sends the test result to Alice.
(13) TTP sends the test result to Bob.
(14) end of this computation.
Figure 13. The TTP-NST algorithm. (Source: Author’s own)
As shown in Figure 13, the TTP-NST algorithm can largely be divided into five
computational stages:
Stage_1) TTP gathers inputs }...,,,{ 21 nxxx and }...,,,{ 21 nyyy from Alice and Bob,
respectively.
Stage_2) TTP performs the comparison for each ),( ii yx and generates Stage_2
intermediate results P , Q , R and N .
Stage_3) TTP generates Stage_3 intermediate results 0T , CR , 1T and 2T , using the
intermediate results of Stage_2.
Stage_4) TTP performs the sign test according to the intermediate results of Stage_3
and obtains the test result (final result).
85
Stage_5) TTP sends the final result to Alice and Bob.
These fives stages enable us to specify the design requirements which can be found in
the next subsection.
3.4 Design Requirements
From the TTP-NST algorithm decomposition, it can be seen that the data privacy
protection in this computation heavily relies on the trustworthiness of the TTP, who
has access to the datasets from both Alice and Bob. If the TTP exposes any
information in relation to the dataset of a party (Alice/Bob) to the other party
(Bob/Alice), the concerned party’s data privacy would be compromised. In addition,
the TTP also knows the intermediate results of the computation, which, if disclosed,
may be used by either of the parties to infer useful information about its counterpart’s
input. The detailed analysis is given below.
In order to design a solution that does not need a TTP, the following four
requirements should be carefully addressed.
Requirement 1: To Protect the Privacy of Input Data Items
According to the assumption of sign test, Alice has }...,,,{ 21 nxxxX , Bob has
}...,,,{ 21 nyyyY and ),( ii yx comes from subject i , for ni ...,,2,1 . The first
objective of this computation is to protect }...,,,{ 21 nxxx from Bob and protect
}...,,,{ 21 nyyy from Alice during the computation. Therefore, we need privacy-
preserving techniques here, to support two-party privacy-preserving (or secure)
comparison of ),( ii yx , where ni ...,,2,1 . After the comparison, Alice only knows
ix and the comparison result; Bob only knows iy and the comparison result.
Requirement 2: To Protect the Difference Between ix and iy
After the pairwised comparison, if the value of )( ii yx is known by either Alice or
Bob, the value of iy or ix can be calculated. This breaches the individual data
confidentiality of ix or iy , and individual privacy of iS . This information can be
86
further used to infer other data items in X or Y . The more data items are inferred, the
more accurately the aggregated information can be estimated.
Requirement 3: To Protect the Privacy of Intermediate Results
As discussed above, during the process of the computation, several intermediate
results are generated, which may be used to infer information about }...,,,{ 21 nxxx
and }...,,,{ 21 nyyy . For example, in the TTP-NST algorithm, the intermediate results
are: P , the number count for ii yx ; Q , the number count for ii yx ; R , the
number count for ii yx ; N , the number count for ii yx , where RPQnN ;
CR , the critical value of this test; 0T , the intermediate result that equals to },min{ RP ;
1T , the test statistic if this is a small-size sign test, and 2T , the test statistic if this is a
large-size sign test. All these intermediate results should be protected. For example, if
Alice gets P , she may infer the value of RQ ; if P or RQ are approximating to
n , Alice might infer the distribution of the joint dataset. If Alice gets both P and Q ,
then she can not only perform the test by herself, but also derive the characteristics of
Bob’s dataset, which compromises the corporate privacy of Bob’s dataset.
Furthermore, in some extreme cases, if a significant amount of information coexists in
the joint datasets, individual privacy and corporate privacy may both be compromised.
For example, some outliers are existed in the joint dataset or the dataset provided by
Alice which is centred to a specific point. It is thus also necessary to secure these
intermediate results using cryptographic means.
By examining the TTP-NST algorithm, it can be seen that the issue of preserving the
privacy of intermediate results affects three stages of the computation: Stage_2,
Stage_3 and Stage_4. As the intermediate results from Stage 2 will be fed into the
computation performed in Stage_3, and the intermediate results from Stage_3 will be
fed into the computation at Stage_4, we need to examine how these intermediate
results can be protected while rendering the computation in the successive stage
possible. For example, four intermediate results are generated in Stage_2: P , Q , R
and N . Because Q contains equal information to N (i.e. QnN ), this
information should be used at Stage_3 to determine the sample size and to find the
CR value. In addition, the value of P and R should also be used to generate the
intermediate result 0T . We need to apply a secure random permutation function to
87
protect the values of P and R , so that even if Alice knows either of the values, she
still needs to spend concerted efforts on guessing the value of P or R , and the
probability of her correctly guessing the value of P or R is 0.5, respectively.
Requirement 4: To Identify Privacy-preserving Techniques
Privacy-preserving techniques should be selected so that the aforementioned security
issues can be appropriately addressed. Once these have been identified, the design of
the P2DNST solution can be commenced.
In addition, four other parameters are also used in the computation, they are z , , n
and CR . Where z is the sample size criterion (negotiated by Alice and Bob prior to
the computation); is the significance level (negotiated by Alice and Bob, also prior
to the computation); n is the number of data subjects; CR is the critical value for the
test and is defined by z , N and . The privacy issues in relation to these four
parameters can be discussed in terms of individual privacy and corporate privacy.
Firstly, disclosing the four parameters does not breach the individual privacy of each
subject or the data inputs. Secondly, the only corporate privacy breached by the four
parameters is the value of RP (= N ), as N is needed to find CR . However,
knowing RP can not help one party to infer information about the other party’s
dataset nor can it be used to infer aggregate information of the other party. In other
words, disclosing RP is acceptable during the execution of the algorithm, in order
to achieve a desired level of privacy preservation.
3.5 Evaluation Strategy
The performance of the designed solutions will be compared with the TTP-NST
algorithm. To do so, the designed solutions will be evaluated with respect to the
following metrics: the correctness, the security level, the computation overhead, the
communication overhead and the protocol execution time. These metrics are
explained below.
3.5.1 Correctness
To ensure that the final computation results of the two privacy-preserving solutions
are exactly the same as the TTP-NST algorithm, the correctness of the design
88
solutions will be verified. The verification will be conducted by analysing the key
intermediate computation tasks throughout the computation. As both protocol suites
are designed to be capable of performing computation on disguised or encrypted
dataset, it will be shown that the privacy-preserving effects imposed on the
intermediate computation results will be removed by later intermediate computation
tasks, without affecting the final computation result. For the first protocol suite, i.e.
the P22NSTP, the final computation result of the privacy-preserving computation
which employs data perturbation techniques will be the same as the TTP-NST
algorithm. For the second protocol suite, i.e. the P22NSTC, the final computation
result of the privacy-preserving computation which employs data perturbation
techniques and an additively homomorphic cryptosystem will be the same as the TTP-
NST algorithm.
3.5.2 Level of Security
Two solutions have been designed in this research work, i.e. the P22NSTP protocol
suite and the P22NSTC protocol suite. The P
22NSTP uses data perturbation techniques
to preserve data privacy, while the P22NSTC employs an additively homomorphic
encryption scheme in addition to data perturbation techniques. As the level of privacy
protection provided by the P22NSTP is based on the data uncertainty provided by data
perturbation techniques, the level of security will be measured by entropy. The
P22NSTC uses both data perturbation techniques and an additively homomorphic
encryption scheme to protect data privacy. The security level of the second solution
depends on the length of the public/private key pair.
3.5.3 Computational Overhead
The computational overhead is estimated by counting types of the computations a
protocol execution requires. The computations performed by a protocol execution can
be classified into non-privacy-preserving computation and privacy-preserving
computation. Privacy-preserving computation can be further classified into data
disguising computation (relatively cheaper) and cryptographic computations
(relatively much more expensive). Cryptographic computations include encryption
and decryption operations. The computational overhead of a solution is estimated by
counting the numbers of respective normal computations and cryptographic
computations takes. The type of every computational task will be classified and the
89
number of each type of tasks will be calculated so as to compare the computation
overhead for each solution.
3.5.4 Communication Overhead
The communication overhead is estimated in terms of two metrics: (1) the number of
messages in a protocol execution and (2) the size of each message. Assuming that
imsg represents a message, )( imsgsize is the size of imsg and there are n messages
in a protocol execution, the total communication overhead for the protocol is
calculated as
n
i imsgsizeOC1
)(.. .
3.5.5 Execution Time
Execution time is the time difference between the start of computation until the end of
the computation when both Alice and Bob have received the final computation result.
The execution time will also be used to estimate the protocol efficiency. The protocol
efficiency is defined as )(
)(Pr..
datasettheofSize
TimeExecutionotocolEP . The ..EP value is the
average computation time for a single data input under a given security level.
3.6 Simulation Method
In this thesis, we use MATLAB to simulate the designed protocol suites. MATLAB is
a high level programming language and interactive environment. It enables
programmers to perform computationally intensive tasks easier and faster than with
traditional programming language. The simulation is hosted on a Windows 7 64bits
OS running on a Dell Optiplex 760 desktop with Intel® Core™2 Duo E8400
processor and 12GB memory. The prototypes are implemented as MATLAB
applications in MATLAB R2010a environment.
3.6.1 Assumptions
The following assumptions have been used in the design of the two protocol suites,
the two-party protocol suite, i.e. the P22NSTP, and the protocol suite based on a semi-
trusted third party (STTP), i.e. the P22NTC.
90
Design Assumptions for the P22NSTP Protocol Suite
(1) The P22NSTP protocol suite is carried out by two participating parties.
(2) Data perturbation techniques are used in the design of the P22NSTP protocol suite
in order to achieve an efficient solution.
(3) The data perturbation techniques should assist the protection of data privacy while
enabling the computation to be carried out.
(4) A third party is not used in the design of this solution.
(5) Each of the participating parties manages its own security parameter; this
parameter is used to control the number of noise data items to be added into its
dataset.
Design Assumptions for the P22NSTC Protocol Suite
(1) The P22NSTC protocol suite is carried out by two participating parties and an on-
line STTP.
(2) The whole computation is decomposed into a set of local computational tasks;
each of the local computational tasks is performed by either one of the parties or
the STTP.
(3) The two parties can not communicate with each other directly.
(4) Data perturbation techniques and a cryptographic primitive are used in the design
of the P22NSTC protocol suite in order to achieve a more secure solution.
(5) The data perturbation techniques and the cryptographic primitive should assist the
protection of data privacy while enabling the computation to be carried out.
91
3.7 Chapter Summary
In this chapter, we have described the design preliminary and evaluation method for
the designed privacy-preserving nonparametric sign test solutions, commencing with
the three layer privacy definitions and the decomposition of the TTP-NST algorithm.
Design requirements have been identified by analysing local/global computational
tasks against the privacy definitions. Finally, the evaluation methodology is given.
The methodology will be used to evaluate the experimental results from MATLAB
simulations.
92
Chapter 4 Privacy-preserving Building Blocks
4.1 Chapter Introduction
This chapter presents an investigation of the privacy-preserving building blocks that
are used in the design of the methods, primitives and protocols presented in this thesis.
These building blocks can be divided into two categories: data perturbation
techniques and cryptographic primitive. Data perturbation techniques include data
swapping, data randomization, data transformation and other data perturbation
techniques. The cryptographic primitive used in this thesis is an additively
homomorphic cryptosystem. In addition, a comparison of the features of these
privacy-preserving building blocks also features at the end of this chapter.
4.2 Data Perturbation Techniques
Data perturbation techniques use simple mathematical or statistical methods to
disguise the original data. As they do not use complex mathematical operations, they
are usually more computationally efficient than cryptographic primitives. However,
they are less secure. Three data perturbation techniques are used in the design of the
solutions presented in this thesis. They are data swapping, data randomization and
data transformation.
4.2.1 Data Swapping
Data swapping involves re-ranking data items in a dataset in a pre-specified order.
With this technique, the value of a data item in the dataset will not be altered; rather,
the order of the data items in the dataset is changed. The match up between a data
subject and its data records can then be disguised. As the values of data items are not
changed, the aggregated properties, e.g. mean value and variance of the dataset are
preserved. Using this technique, the resulting dataset can maintain high data quality
with low information loss. However, in the case of an intruder collecting as many data
records as possible, the aggregated properties, such as mean value and variance, can
still be inferred using the data he/she has collected. In most cases, the sole use of data
swapping is not sufficient for privacy preservation. For a better level of privacy
93
protection, it should be used in conjunction with other privacy-preserving techniques.
Figure 14 illustrates a data swapping operation.
Figure 14. An example of data swapping operation. (Source: Author’s own)
Figure 14 shows a row order swapping. As shown in the figure, the order of the data
subjects in the original dataset is re-ordered to {4,5,2,1,3} in the swapped dataset.
This technique is particularly effective when protecting the order information is
critical in a dataset.
4.2.2 Data Randomization
Data randomization is among the first proposed data perturbation techniques for
privacy preservation in the literature [WILL’96]. With this technique, the aggregated
properties of the dataset can be preserved while the actual value of a data input or the
match up between a data subject and a data item can be disguised. There are two
approaches to data randomization. The first approach is to add a randomized noise
value to a data item, i.e. the so called noise value addition randomization. This helps
to preserve the privacy of individual data items. One of the exemplars of this
approach is to add a random value ir to every data item in a dataset X , where
}...,,{ 1 nxxX and for ni ...,,1 , ir are randomly picked up from a chosen uniform
distribution ),( U or normal distribution ),( 2N . Figure 15 illustrates this
approach.
94
Figure 15. An example of noise value addition randomisation. (Source: Author’s own)
The second approach is to add a large quantity of randomized data items into a dataset,
i.e. the so called noise items addition randomization. With this approach, the original
values of the data items in the dataset are not changed; rather, the size of the dataset is
increased. The match up of a data subject and its data object is disguised by the noise
data items. This approach is useful for computations where the original data values in
a dataset can not be modified. The focus of this approach is on the distortion of the
match up between data subject and its data objects, and aggregated information in the
dataset. For example, with this approach, the mean value and the variance of the
dataset are also changed. It is often used in conjunction with the data swapping
technique. Figure 16 illustrates an application of this noise addition randomization
approach.
Figure 16. An example of noise addition randomisation. (Source: Author’s own)
95
As shown in Figure 16(a), the original dataset has four rows of data records
},,,{ 4321 SSSS . Then four rows of noise data, i.e. },,,{ 4321 NNNN , are added (as
shown in Figure 16(b)). Finally, the data swapping technique is used to swap the order
of the rows, resulting in a randomized and swapped dataset, as shown in Figure 16(c).
Finally, a randomized and swapped dataset is generated. If the swap rule and noise
data items are kept secret, the privacy of the original dataset, e.g. the secrecy of
},,,{ 4321 SSSS , mean value and variance, can also be kept secret from an intruder.
4.2.3 Data Transformation
Data transformation is a technique that changes the value of a data item or the format
of a dataset based on specific mathematical functions. Depending on a predefined
transformation rule, this transformation can be a one-to-one or one-to-many or many-
to-one mapping transformation. Such techniques include geometric transformation
[DU’01b, DU’01c], linear transformation [DU’01b, DU’01c], polynomial
transformation [DU’01b, DU’01c]. The geometric transformation technique
transforms an original dataset using an invertible matrix and is based on the linear
algebra theory. The linear transformation is a special case of geometric transformation.
It maps a dataset from one vector space to another vector space. Examples of linear
transformation include data rotation [SHOR’07], data reflection [SHOR’07] and data
projection [SHOR’07]. The geometric transformation and the linear transformation
are both one-to-one mapping transformations. The polynomial transformation
technique transforms a dataset according to a predefined polynomial. By choosing
different polynomials, the transformation can be a one-to-one or many-to-one or one-
to-many mapping transformation.
In a one-to-one mapping transformation, the data transformation technique is
invertible, i.e. the transformed data items or dataset can be restored to the original
data items/dataset as long as the transforming function can be found. For example, if
the invertible matrix used in a geometric transformation is known, the dataset can be
restored. The one-to-many and many-to-one mapping transformations are not
invertible. For example, given that the transforming function is
0,1
0,1)(
x
xxf , if
the original dataset is }4,0,3,3,2{ X , the transformed dataset will be
}1,1,1,1,1{' X . Knowing )(xf is not sufficient to restore X from 'X .
96
Moreover, if )(xf is defined as
0,1
0
0,0
1
)(
x
x
xf , the transformed dataset 'X will
be
01100
10011'X , the format of the dataset is also changed.
4.3 Cryptographic Primitives
A cryptographic primitive is a well-established cryptographic algorithm that can be
used to construct computer security systems. This subsection describes cryptographic
primitives that are used in the design of our solutions.
The following cryptographic terms and definitions are used in the description in this
section and the remaining part of this thesis. Plaintext denotes a message in a readable
form, where a message can be any data that we may want to transfer. Message
encryption is a process of disguising a plaintext message to hide its meaning. An
encrypted message is referred to as ciphertext. Message decryption is a reverse
process of transforming a ciphertext into its original form. Encryption and decryption
are performed through the use of a cryptographic algorithm and a cryptographic key.
A key is typically a large random number, chosen from a pool of possible values
called key space. A cryptosystem is a system performing encryption and decryption
operations. In addition to encryption and decryption algorithms, it also has a range of
algorithms for key generation.
4.3.1 Additively Homomorphic Cryptosystem
Homomorphic Encryption
Homomorphic encryption is a form of encryption where a specific algebraic operation
performed on the plaintext is equivalent to another algebraic operation performed on
the ciphertext where the operation performed on the plaintext can be different from
the operation performed on the ciphertext. Homomorphic encryption schemes are
malleable by design. The homomorphic properties of various cryptosystems have
been applied in a variety of research topics [DAMG’03, RAPP’04], such as secure
voting systems, collision-resistant hash functions, and private information retrieval
97
schemes. In recent years, they have also been widely used in cloud computing to
ensure the confidentiality of the processed data [MICC’10].
A cryptosystem which supports only one algebraic operation is classified as a partially
homomorphic cryptosystem. Examples of partially homomorphic cryptosystem
include RSA cryptosystem [MENE’01], EIGamal cryptosystem [MENE’01], Benaloh
cryptosystem [BENA’94] and Paillier cryptosystem [PAIL’99a]. These cryptosystems
are described below.
- RSA Cryptosystem [MENE’01]
In the RSA cryptosystem, assuming that p and q are two large random and distinct
prime numbers, then pqn and )1)(1()( qpn . Select a random integer e
where )(1 ne , such that 1))(,gcd( ne . Compute d using the Extended
Euclidean algorithm where )(1 nd , such that ))((mod1 ned , i.e.
))((mod1 ned ; d is the multiplicative inverse of )(mod ne . The public
key consists of n and e , i.e. ),( enpk and the private key is d , i.e. dsk . Then
the encryption of a message x is given by computing nxxE e
pk mod)( . Its
homomorphic property is shown as follows:
)(mod)(mod)()( 21212121 xxEnxxnxxxExE pk
eee
pkpk .
The output of performing a multiplication on two ciphertexts is equivalent to the
output of encrypting the result of performing a multiplication on their original
plaintexts. As the RSA cryptosystem has the multiplicative property, it is classified as
a multiplicatively homomorphic cryptosystem.
- EIGamal Cryptosystem [MENE’01]
In the EIGamal cryptosystem, assuming that p is a large random prime number, is
a generator of the multiplicative group *
p of the integers modulo p and a is a
random integer where 21 pa . Then, the public key is ),,( ap , i.e.
),,( appk and the private key is a , i.e. ask . By randomly selecting an
integer k where 21 pk , the encryption of a plaintext x is given by computing
98
pk mod and px ka mod)( . ( is a parameter for decryption.) Its
homomorphic property is given as follows:
21
2121
21mod)()(mod))(())(( 2121
xx
kkakaka
xx pxxpxx
.
The output of performing a multiplication on two ciphertexts is equivalent to the
output of encrypting the result of performing a multiplication on their original
plaintexts. As the EIGamal cryptosystem has the multiplicative property, it is
classified as a multiplicatively homomorphic cryptosystem.
- Beneloh Cryptosystem [BENA’94]
In the Beneloh cryptosystem, assuming that the blocksize is r , p and q are two
large random prime numbers where r divides )1( p , 1)1
,gcd(
r
pr and
1)1,gcd( qr . Then pqn . Select }1),gcd(:{* nxZxy nn such that
1mod
)1)(1(
ny r
qp
. The public key is ),( ny , i.e. ),( nypk , and the private
key is ),( qp i.e. ),( qpsk . By randomly and uniformly selecting an u where
*
nu , the encryption of a message x is given by computing
nuyxE rx
pk mod)( . Its homomorphic property is shown as follows:
)(
mod)()(mod)()()()(
21
2121212121
xxE
muuymuyuyxExE
pk
rxxrxrx
pkpk
.
The output of performing a multiplication on two ciphertexts is equivalent to the
output of encrypting the result of performing an addition on their original plaintexts.
As the Beneloh cryptosystem has the additive property, it is classified as an additively
homomorphic cryptosystem.
- Paillier Cryptosystem [PAIL’99a]
In the Paillier cryptosystem, assuming that p and q are two large random prime
numbers such that 1))1)(1(,gcd( qppq . Compute pqn and
))1(),1(()( qplcmn . Select a random integer g where *
2ng such that n
99
divides the order of g . (This can be ensured by checking the multiplicative inverse ,
where nngL n mod))mod(( 12)( ) and L is defined as n
uuL
1)(
.) The
public key consists of n and g , i.e. ),( gnpk and the private key consists of
)(n and , i.e. )),(( nsk . By randomly selecting a r where *
nr , the
encryption of a message x is given by computing 2mod)( nrgxE nx
pk . Its
homomorphic property is shown as follows:
)(
mod)()(mod)()()()(
21
2
21
2
21212121
xxE
nrrgnrgrgxExE
pk
nxxnxnx
pkpk
.
The output of performing a multiplication on two ciphertexts is equivalent to the
output of encrypting the result of performing an addition on their original plaintexts.
As the Paillier cryptosystem has the additive property, it classified as an additively
homomorphic cryptosystem.
Assuming that 1m and 2m are two plaintexts and ))(),(,,( skpk DEskps is a
homomorphic cryptosystem, Figure 17 illustrates this homomorphic eproperty.
Figure 17. The homomorphic cryptosystem. (Source: Author’s own)
Each of the examples listed above allows homomorphic computation of only one
operation (either addition or multiplication) on plaintexts. A cryptosystem which
100
supports both additively and multiplicatively homomorphic properties is known as
fully homomorphic cryptosystem and is far more powerful [GENT’09]. Using such a
scheme, any circuit can be evaluated homomorphically, effectively allowing the
construction of programs. These programs can be run on encryptions of their inputs to
produce an encryption of their output. Since such a program does not need to decrypt
its input, it can be run by an un-trusted party without revealing its inputs and
computation details. The existence of an efficient and fully homomorphic
cryptosystem would have practical implications in the outsourcing of private
computations, for example, in the context of cloud computing [GENT’09].
Additively Homomorphic Encryption
In this thesis, we utilise the additively homomorphic property in the design of one of
our solutions, i.e. the P22NSTC protocol suite. We use the following property:
)()()( yxEyExE pkpkpk and yxyxED pksk ))(( . This allows us to conduct
an additive operation on encrypted data items without decrypting them. For example,
in a two-party computation with the assistance of a semi-trusted third party (STTP),
Alice and Bob negotiate an additively homomorphic encryption scheme, i.e.
))(),(,,( skpk DEskpk where pk is the public key, sk is the private key, )(pkE is
the encryption algorithm and )(skD is the decryption algorithm, prior to the
computation. Alice encrypts her private data x and generates )(xEpk ; Bob encrypts
his own private data y and generates )(yEpk . Alice and Bob then send )(xEpk and
)(yEpk to the STTP, respectively. The STTP can then calculate
)()()( yxEyExE pkpkpk and send )( yxEpk back to Alice and Bob. This
property preserves the privacy of x and y , preventing them from being disclosed to
the STTP, while ensuring that the STTP can calculate the desired result by performing
the multiplication operation on the encrypted data items.
Paillier Cryptosystem [PAIL’99a]
As the design of the P22NSTC protocol suite needs the additive property of the
additively homomorphic cryptosystem and the Paillier cryptosystem is regarded as the
most efficient additively homomorphic cryptosystem among its kind [CATA’01,
AKIN’09], it has been chosen to be employed in the implementation of the P22NSTC
101
protocol suite. Assuming that a party Alice constructs a Paillier cryptosystem, the
detail of this cryptosystem is described below.
- Key generation
1. To generate a key pair, Alice chooses two large secret prime numbers p and q ,
such that 1))1)(1(,gcd( qppq .
2. Alice sets pqn and computes ))1(),1(()( qplcmn .
3. Alice selects a random integer g , where *2n
g .
4. Alice checks the existence of nngL n mod))mod(( 12)( (the modular
multiplicative inverse), where L is defined as n
uuL
1)(
, to ensure that n
divides the order of g .
5. The public key (i.e. pk ) is ),( gn .
6. The private key (i.e. sk ) is )),(( n .
- Encryption
1. Let m be a message to be encrypted where nm .
2. Alice selects a random r where *
nr .
3. Alice computes the ciphertext of m , i.e. c , as follows:
2mod)( nrgmEc nm
pk .
- Decryption
1. The ciphertext *
2nc .
2. Alice recovers the plaintext m by computing
nngL
ncL
nncLcDm
n
n
n
sk
mod)mod(
)mod(
mod)mod()(
2)(
2)(
2)(
.
102
- Additively homomorphic property
1. Let 1m and 2m be two plaintexts.
2. the additively homomorphic property is:
)(
mod))((
mod))(()()(
21
2
21
2
2121
21
21
mmE
nrrg
nrgrgmEmE
pk
nmm
nmnm
pkpk
.
And
))(())()(( 212121 mmEDmmmEmED pkskpkpksk .
4.4 A Comparison of Privacy-preserving Building Blocks
Figure 18. A comparison of privacy-preserving building blocks. (Source: Author’s own)
Figure 18 compares the main features of the four privacy-preserving building blocks
described in this chapter. This comparison is done by comparing the building blocks
using the following six criteria:
(1) Does this building block change the values of the data items?
103
(2) Does this building block change the order of the data items?
(3) Does this building block change the format of the data items?
(4) Is this privacy-preserving operation reversible?
(5) The computational effort.
(6) The security level.
When the data swapping technique is applied to a dataset, it only changes the order of
the data items in the dataset. Neither the values nor the format of the data items is
changed. The technique can only provide a low level of security protection on dataset
reordering. As the swap operation is performed based on a pre-specified order, it is
reversible. It can be seen as a dataset which has been reordered by a permutation
matrix. Every permutation matrix is invertible. As the data swapping operation only
involves one matrix operation, the computational cost of this technique is low.
Noise data item addition randomization technique adds a large quantity of noise data
items into a dataset. It aims to hide the original data items in the noise data items. The
computational efforts required and the security level provided by this technique
depend on the number of noise data items it generates and adds into the original
dataset. It can be observed that if the number of noise data items is small, the
computational effort will be less. In addition, if the number of noise data items is
small in relation to the number of data items in the original dataset, the security
protection level will be corresponding low. A higher level of protection requires the
addition of more noise data items which will impose more computational effort. This
technique is often employed together with the data swapping technique.
Noise value addition randomization technique adds a random noise value to every
data item in the original dataset. The computational effort only involves noise value
generation and noise value addition. The security protection level is dependent on the
confidentiality of the noise value generation function. Once the function is known,
this operation can easily be reversed.
Data transformation technique can change both the value and format of data items
depending on the transformation functions used. If the transformation function is a
104
one-to-one mapping function, the transformation result is reversible. However, if the
transformation function is a one-to-many or many-to-one function, the transformation
result may not be reversible. The computational effort of the data transformation
technique depends on the transformation operations. If a more complicated function is
used, a higher computational effort is needed and a better level of protection is
provided.
Additively homomorphic cryptosystem changes the values and format of the data
items to be protected. It provides the best level of security protection among these
privacy-preserving building blocks. As the encryption/decryption operations are much
more complicated than normal algebraic operations, it also consumes the highest
computational effort.
4.5 Chapter Summary
This chapter has presented the privacy-preserving building blocks that form the basis
for the design of our two-party based and STTP based protocol suites. The first
protocol suite, i.e. the P22NSTP protocol suite, using a two-party based approach,
makes use of five novel building blocks that are designed based on the three data
perturbation techniques. The second solution, i.e. the P22NSTC protocol suite, using
the STTP based approach, is built on three novel building blocks that are designed
based on the three perturbation techniques and a cryptographic primitive. The details
of these protocol suites designs will be presented in chapter 5 and chapter 6,
respectively.
105
Chapter 5 A Novel Privacy-preserving Two-party
Nonparametric Sign Test Protocol Suite
Using Data Perturbation Techniques
(P22NSTP)
5.1 Chapter Introduction
This chapter details the design of the P22NSTP protocol suite. Based on the design, the
security threats against privacy requirements are analysed. The correctness, the level
of the privacy it provides, the computational overhead and communication overhead
are also theoretically analysed in this chapter. The overview of the P22NSTP protocol
suite is described in section 5.2. Section 5.3 presents the detailed design, including
computation participants, message objects, components of the P22NSTP protocol suite
and elements of the computation. Finally, section 5.4 describes the operation of this
protocol and discusses the correctness, its privacy performance, computational
overhead and communication overhead.
5.2 Overview of the P22NSTP Protocol Suite
This protocol suite is designed for two parties, Alice and Bob, to collaboratively
perform the sign test computation while preserving the individual data confidentiality,
individual privacy and corporate privacy of X and Y . According to the design
requirements specified in chapter 3, this privacy-preserving sign test computation is
divided into four local computation tasks; each local computational task is carried out
by a specific protocol (or specific protocols) of the designed protocol suite. The data
perturbation techniques that have been selected in chapter 4 are used to preserve the
data privacy while supporting the computation of a task to be carried out. The
P22NSTP protocol suite consists of five components:
(1) The Random Probability Density Function Generation Protocol (RpdfGP).
(2) The Data Obscuring protocol (DOP).
(3) The Secure Two-party Comparison Protocol (STCP).
106
(4) The Data Extraction Protocol (DEP).
(5) The permutation reverse protocol (PRP).
The RpdfGP is employed by DOP and STCP as a function to generate a randomised
probability density function (pdf) during the protocol execution. This pdf is then used
to randomly generate noise data items by both of the protocols. The remaining four
protocols each performs a local computational task that is generated from the
decomposition of the sign test computation on an ideal trusted third party model
(TTP-NST). The detailed operations and their corresponding computational tasks are
described below.
Local Computational Task 1 - DOP
Assuming that Alice initiates a P22NSTP computation. Alice first executes DOP. DOP
is designed to protect the individual privacy and corporate privacy of X at the first
stage of the P22NSTP computation. DOP employs RpdfGP, data swapping and data
randomization techniques to generate a randomised data matrix MdTTX12
' . The
confidentiality of X is then disguised and concealed by MdTTX12
' . MdTTX12
' is sent to
Bob and used as one of the data inputs of STCP. (Assuming that the size of X is n ,
and 'n is the level of noise item addition managed by Alice, then MdTTX12
' is a matrix
of dimension )'()'( nnnn .)
Local Computational Task 2 - STCP
After receiving MdTTX12
' from Alice, Bob executes STCP to securely compare X and
Y without learning X . STCP employs RpdfGP, data swapping, data randomization
and data transformation techniques, to generate )'( nn disguised matrices, i.e. igDTU 3 ,
for )'(...,,1 nni . ( 'n is the level of noise item addition in MdTTX12
' , and it is
generated by Alice. Once Bob has received MdTTX12
' , he can know the dimension of
MdTTX12
' , i.e. )'()'( nnnn . As Bob knows the value of n , he can find out the
value of 'n by computing nnn )'( .) More specifically, STCP protects individual
data confidentiality, individual privacy and corporate privacy of Y . The corporate
privacy of the joint dataset, i.e. the intermediate computation result },,{ RQP is also
107
protected and concealed by these )'( nn disguised matrices. These )'( nn matrices
will be sent to Alice and used as the data input of DEP.
Local Computational Task 3 - DEP
After receiving 1
)'(}{ 3
i
nn
gDT iU from Bob, Alice executes DEP to reverse the data
disguising effects in 1
)'(}{ 3
i
nn
gDT iU that are imposed by DOP. In addition, DEP
computes all possible sign test results in relation to 1
)'(}{ 3
i
nn
gDT iU , i.e. 1
)1)(2)(3(}{
i
lll
I
idc
and 1
)1)(2)(3(}{
i
lll
T
iR , where dI
ic is an index of a combination (there are
)1)(2)(3( lll combinations in total) and T
iR is the test result in relation to dI
ic ,
for )1)(2)(3(...,,1 llli . 1
)1)(2)(3(}{
i
lll
I
idc and
1
)1)(2)(3(}{
i
lll
T
iR will be sent to Bob
and used as the data input of PRP.
Local Computational Task 4 - PRP
Finally, Bob executes PRP to single out the final test result, i.e. finalTR , from
1
)1)(2)(3(}{
i
lll
T
iR . Bob then sends the final test result finalTR to Alice.
Figure 19 provides an overview of the P22NSTP execution.
Figure 19. An overview of the P22NSTP computation.
(Source: Author’s own)
108
5.3 The Design in Detail
It can be seen from Figure 19 that after the execution of each of the local
computational tasks, an intermediate computational result is generated. This
intermediate computation result will be later transmitted to the other party along with
further data (detailed in subsection 5.3.1), and used as the input of the next local
computation task. At the last stage of the computation, the final result is generated by
Bob and then sent to Alice. To summarise, four messages are transmitted during the
P22NSTp protocol suite execution. The following subsections present the design
details of the P22NSTP protocol suite.
5.3.1 Computation Participants and Message Objects
Computation Participants:
There are two participants, Alice and Bob. Alice holds dataset X , and Bob holds
dataset Y , where }...,,{ 1 nxxX , }...,,{ 1 nyyY and X and Y are vertically
partitioned.
Message Objects:
The execution of this protocol suite consists of four local computations. Assuming
that Alice initiates the computation. Four message objects are transferred during the
computation, which are:
Message 1: Alice sends Bob the output of DOP, i.e. )(XDOP , and 1T , where
)(XDOP = MdTTX12
' . 1T is a permutation matrix generated by Alice. That
is, },'{ 11 12TXmsg MdTT .
MdTTX12
' and 1T are both matrices of dimension )'()'( nnnn .
Message 2: Bob sends Alice the output of STCP, i.e. ),,'( 112TYXSTCP MdTT , where
1
)'(1 }{),,'( 2
12
i
nn
gDT
MdTTiUTYXSTCP . That is, }}{{ 1
)'(22
i
nn
gDT iUmsg .
igDTU 2 is a matrix of dimension )'()3( nnl . There are )'( nn
matrices in 2msg .
109
Message 3: Alice sends Bob the output of DEP, i.e. ),}({ 2
1
'2 Ri
nn
gDTTabUDEP i
, where
}}{,}{{),}({ 1
)3)(2)(1(
1
)3)(2)(1(2
1
'2
i
lll
T
i
i
lll
I
i
Ri
nn
gDTRcTabUDEP di . That is,
}}{,}{{ 1
)3)(2)(1(
1
)3)(2)(1(3
i
lll
T
i
i
lll
I
i Rcmsg d .
dI
ic is a vector of dimension 31 . T
iR is an integer value and T
iR is
either “0” or “1”, where “0” represents “do not reject 0H ” and “1”
represents “reject 0H ”. T
iR is the computational result in relation to
dI
ic . There are )1)(2)(3( lll vectors and )1)(2)(3( lll integer
values in 3msg .
Message 4: Bob sends Alice the output of PRP,
i.e. ),}{,}({ 3
1
)3)(2)(1(
1
)3)(2)(1(
Ri
lll
I
i
i
lll
T
i TabcRPRP d
,
where final
Ri
lll
I
i
i
lll
T
i TRTabcRPRP d
),}{,}({ 3
1
)3)(2)(1(
1
)3)(2)(1( .
That is, }{4 finalTRmsg .
finalTR is an integer value. There is only one integer value data item in
4msg .
5.3.2 Components of the P22NSTP Protocol Suite
The five novel components are detailed below.
5.3.2.1 Random Probability Density Function Generation Protocol (RpdfGP)
RpdfGP is designed for a party (Alice and Bob, respectively) to randomly generate a
Probability Density Function (pdf) based on their respective original dataset. This
random pdf will then be used by the parties to further generate noise data items to
disguise their original datasets. By adding a large number of noise data items, the
individual privacy and the corporate privacy of the original dataset can be protected.
In the case of Alice executing this protocol, it can be detailed as follows.
110
Random Probability Density Function Generation Protocol (RpdfGP)
Input: ]...,,[ 11 nn xxX .
Output: )(mf .
(1) Calculate n
xn
i i
x
1 .
(2) Calculate n
xn
i xi
x
)(12
.
(3) Calculate }{maxmax ii
xx .
(4) Calculate }{minmin ii
xx .
(5) Randomly generate three integers, 1k , 2k and 3k , from )100,1(U .
(6) Randomly generate 'x from ),0( 1 xkU .
(7) Randomly generate 2
'x from ),0( 2
2 xkU .
(8) Randomly generate min'x from ),0( minxU .
(9) Randomly generate max'x from ),0( max3 xkU .
(10) Randomly select )(mf from the pdf of ),( 2
'' xxN or the pdf of
)','( maxmin xxU , with probability 2
1.
(11) Output )(mf .
Figure 20. The RpdfGP algorithm. (Source: Author’s own)
The rationales of key computation steps in Figure 20 are explained below.
In step (5), 1k , 2k and 3k are security parameters managed by the protocol
executor. They are later used in step (6), (7) and (9) to generate 'x , 2
'x and
max'x , respectively. 1k is randomly drawn from )100,1(U . This ensures that 'x
can be randomly drawn from a wider interval, i.e. an interval between 0 and
xk 1 . This increases the randomness of 'x and the noise data items
),( 2
'' xxN generates. 2k is randomly drawn from )100,1(U , and this ensures
that 2
'x can be randomly drawn from a wider interval, i.e. an interval between 0
and 2
2 xk . This increases the randomness of 2
'x and the noise data items
111
),( 2
'' xxN generates. 3k is randomly drawn from )100,1(U , and this ensures
that max'x can be randomly drawn from a wider interval, i.e. an interval between
0 and max3 xk . This increases the randomness of max'x and the noise data items
)','( maxmin xxU generates. The function ),( baU is a configurable security
parameter that is managed by Alice. By controlling a and b , a specific level of
security can be achieved.
In step (10), randomly selecting )(mf from ),( 2
'' xxN and )','( maxmin xxU is
also a way of increasing the randomness of noise data item generation. By using
this method, the random noises from )(mf can be generated either from
),( 2
'' xxN or )','( maxmin xxU with the probability of 2
1. (The list of pdf
candidates is also managed by the protocol executor. For example, )(mf can be
chosen from nfff ...,,, 21 with the probability of n
1, where nfff ...,,, 21 are n
different functions.)
How the corporate privacy of nX 1 is preserved.
The output of RpdfGP is a pdf, and this pdf is either a normal distribution function or
a uniform distribution function. It contains disguised information about the original
dataset. This protocol will be employed by both Alice and Bob to generate noise data
items to conceal X and Y , respectively. As this pdf is generated by increasing the
value range of the original dataset, it becomes more difficult to distinguish the
original data items from the resulting randomised dataset. For example, after Alice
executes this protocol and generates )(mf , )(mf will then be used to generate a
large number of noise data items to the randomised dataset in the next stage of
computation. Despite whatever features the original dataset holds, if an intruder wants
to know the information about X and has collected as many data items as he/she can,
the corporate privacy that can be inferred from the randomised dataset directly is
either ),(~ 2
'' xxNX or )','(~ maxmin xxUX , where 'x , 2
'x , min'x and max'x have
112
all been randomised. The corporate privacy of X , for example, the distribution
properties, can be concealed by these random values generated by )(mf .
5.3.2.2 Data Obscuring Protocol (DOP)
The data obscuring protocol (DOP) is designed to protect the confidentiality of an
original dataset at the first stage of the P22NSTP execution. RpdfGP, data
randomization and data swapping techniques are employed in DOP. By using RpdfGP,
a managed number of noised data items are generated. These noise data items and the
original dataset are merged into a bigger dataset. This dataset is then swapped using a
random permutation matrix, i.e. a permutation matrix. The individual privacy and
corporate privacy of the original dataset can then be preserved. Assuming that Alice
executes this protocol, it can be detailed as follows.
Data Obscuring Protocol (DOP)
Input: nX 1 .
Output: MdTTX12
' .
(1) Execute RpdfGP to generate a random pdf )(mf A .
(2) Generate a random integer 4k from )100,1(U .
(3) Set nkn 4' .
(4) Use )(mf A to generate a random vector ]...,,[ '1 naug aaX , where )(~ mfa Ai .
(5) Merge X and augX to produce 'X , where ]...,,,...,,[' '11 nn aaxxX .
(6) Generate a permutation matrix 1T of dimension )'()'( nnnn .
(7) Compute 1''1
TXX T , assume ]'...,,'[' )'(11 nnT xxX .
(8) Generate a table CTab1 to record the swapping order for each ix ' in relation to jx
and ka , where )'(...,,1 nni , nj ...,,1 and '...,,1 nk .
(9) Transform 1
'TX to a diagonal matrix, 1
'dTX , of dimension )'()'( nnnn , i.e.
)'(
2
1
'00
0'0
00'
'1
nn
dT
x
x
x
X
.
(10) Generate a random matrix M of dimension )'()'( nnnn by using )(mf A
in the following manner: assuming that the elements of M are ijm , where
)'(...,,1 nni and )'(...,,1 nnj . 0ijm for ji ; for ji , ijm are
values randomly generated from )(mf A .
(11) Generate a randomised dataset matrix, MdTX1
' , by computing
113
MXX dTMdT 11
'' .
(12) Randomly generate a permutation matrix 2T of dimension )'()'( nnnn .
(13) Swap the row order of MdTX1
' by computing MdTMdTT XTX112
'' 2 .
(14) Generate a table RTab2 to record the indices of every ix' in MdTTX12
' , for
)'(...,,1 nni .
Figure 21. The DOP algorithm. (Source: Author’s own)
The rationales of key computation steps and the formats of the computational
elements in Figure 21 are described below.
In step (2), 4k is a security parameter managed by Alice. It is later used in step (3)
to generate 'n , i.e. the noise addition level of MdTTX12
' . As nkn 4' and 4k is
randomly drawn from )100,1(U , the number of noise data items to be added is
also random and is at least nk 4 . The larger the 4k , the more the number of
noise data items are to be added. Consequently, more computational effort will
be needed to process the disguised dataset. The probability function ),( baU is a
configurable security parameter that is managed by Alice. By controlling a and
b , a desired level of security can be achieved.
In steps (4) - (5),
assuming that and ,
then .
'X is the dataset that contains original data items and noise data items. The
column order of the data items in 'X has not been swapped.
114
In steps (6) - (8),
assuming that the permutation matrix
)'()'(
1
0010
0001
0100
1000
nnnn
T
,
then ,
and .
1T swaps the column order of the data items in 'X . CTab1 records the change of
the column orders caused by 1T .
The purpose of using augX and 1T is to prevent the sample size of X and Y
from being learnt by external attackers. In a P22NSTP computation, although the
messages are assumed to be transferred via secure channels, there is still a
possibility that external attackers might breach the channels and get the messages.
In such a case, external attackers may hold the same information as Alice and
Bob, except the size of X and Y . If external attackers do not know the size of
the original dataset, it would be more difficult for them to restore the original
dataset. The use of 1T can preserve the corporate privacy of X and Y against
external attackers throughout the P22NSTP computation. If the possibility of
external attackers is a concern, the use of augX and 1T may not be necessary.
115
In steps (9) - (11), by using 1
'TX , the details of 1
'dTX and MdTX1
' are illustrated as
follows:
and .
In steps (12) - (14), assuming that the permutation matrix
)'()'(
2
0001
0010
0100
1000
nnnn
T
,
then ,
and .
2T swaps the row order of the rows in MdTX1
' . RTab2 records the changes of the row
orders caused by 2T .
116
How privacy is preserved.
As in a vertically partitioned data model, the two parties are dealing with an identical
set of data subjects, the identities of which are implied by the dataset and may already
be known to both Alice and Bob. One way to preserve the individual privacy in such a
data model is to make the inference of the match up between a certain data item and
the identity of the subject of the data item as difficult as possible. The level of
difficulty for Bob to infer the identity of the subject to whom a specific data item
managed by Alice is associated is dependent on the sample size, n . The larger the
sample size, the higher the level of difficulty. As the random noise data items are
generated by )(mf A , the original data items of X can not be distinguished from
noise data items after being mixed with them. Alice can control the level of difficulty
by managing the size of 'n . This protocol adds nnn 2)'( noise data items to 1
'dTX
so as to generate the randomised matrix MdTX1
' . In addition, MdTX1
' is further
permuted using 2T so as to generate MdTTX12
' . If curious Bob wants to infer X from
MdTTX12
' , he would have to figure out the real value of 2T . The probability for Bob to
successfully guess 2T is )!'(
1
nn , which is negligible when 'n is sufficiently large in
relation to n .
5.3.2.3 Secure Two-party Comparison Protocol (STCP)
The secure two-party comparison protocol (STCP) is designed to allow one of the two
parties to compare the two datasets without learning either the exact values of the
other party’s dataset or the comparison results. In addition to RpdfGP, data
randomization and data swapping techniques, data transformation technique is used in
this protocol to preserve the difference of each pairwised comparison. STCP is
executed by Bob after Alice executes DOP. It compares his private dataset Y with
Alice’s private dataset X , while the individual privacy of Y and the corporate
privacy of the comparison results, i.e. },,{ RQP , are preserved. The detailed
description of this protocol is given below.
117
Secure Two-party Comparison Protocol (STCP)
Input: MdTTX12
' and 1T (from Alice); nY 1 from Bob.
Output: 1
)'(}{ 13
i
nn
gDTU .
(1) Execute RpdfGP to generate a random pdf )(efB .
(2) Use )(efB to generate a random vector, ]...,,[ '1 naug eeY , where )(~ efe Bi .
(3) Merge Y with augY to generate ],...,,...,,[' '11 nn eeyyY .
(4) Compute 1''1
TYY T . For the sake of clarity, we assume ]'...,,'[' )'(11 nnT yyY .
(5) Generate a matrix 1'' TYY of dimension )'()'( nnnn , where
)'(21
)'(21
)'(21
'
'''
'''
'''
'1
nn
nn
nn
TY
yyy
yyy
yyy
Y
.
(6) Compute 112 ''' TYMdTT YXD , where D is the comparison result matrix. We
assume
)')('(2)'(1)'(
)'(22221
)'(11211
nnnnnnnn
nn
nn
ddd
ddd
ddd
D
and the row vectors of D are
1D , …, )'( nnD , i.e. ]...,,[ )'(1111 nnddD , …, ]...,,[ )')('(1)'()'( nnnnnnnn ddD . Here
D indicates the result of MdTTX12
' subtracting 1'' TYY ; iD indicates the results of
the thi row of MdTTX12
' subtracting 1
'TY , for )'(...,,1 nni .
(7) Generate a function )(dg such that
.0,
1
0
0
.0,
0
1
0
.0,
0
0
1
)(
d
d
d
dg
(8) Compute ))]((...,)),([()( )'(1 nniii dgdgDg , for )'(...,,1 nni . Here )( iDg is
the transformed comparison result of iD and it is a )'(3 nn data matrix.
(9) Generate a random integer 5k from )100,1(U .
(10) Set
33 5 nk
l .
(11) To disguise 1
)'()}({
i
nniDg , generate random matrices iU with dimension
)'( nnl , for )'(...,,1 nni , where the elements of iU are randomly
drawn from }1,0{ . These random matrices are generated so as to add random
118
noises to every )( iDg , for )'(...,,1 nni .
(12) For )'(...,,1 nni , generate igDU by merging )( iDg with iU in such a way
that )( iDg is the upper part of gDiU and iU is the lower part of igDU , where
the dimension of igDU is )'()3( nnl . That is, the first three rows of each
igDU contain the obscured comparison results, i.e. )( iDg , with the remaining
l rows being the random noises, i.e. iU , for )'(...,,1 nni .
(13) Generate a random permutation matrix 3T of dimension )3()3( ll .
(14) Compute ii gDgDTUTU 3
3 , for )'(...,,1 nni .
(15) Generate a table RTab3 to record the swapped order of )( iDg in relation to
igDTU 3 , for )'(...,,1 nni .
Figure 22. The STCP algorithm. (Source: Author’s own)
The rationales of key computation steps and the formats of the computational
elements in Figure 22 are further explained below.
In steps (2) - (4),
Assuming that and ,
then ,
and .
(The green shadows indicate locations where the original iy are located, for
ni ...,,1 .)
119
In steps (5) and (6),
Assuming that
and ,
then , , …,
.
(The green shadows in iD indicate column locations where the original )( ii yx
are located, for ni ...,,1 , and the locations have been disguised.)
In step (8),
assuming that ,
, …,
120
and after the data
transformation. The value of ij
kg is either 1, 0 or -1, for )'(...,,1 nni ,
)'(...,,1 nnj and )'(...,,1 nnk .
In step (9), 5k is a security parameter managed by Bob. It is later used in step
(10) to generate l , i.e. the noise row addition level of 1
)'(}{ 3
i
nn
gDT iU .
33 5 nk
l
and 5k is randomly drawn from )100,1(U , this ensures that the number of noise
rows to be added in every igDTU 3 is at least 5k times more than n and l . l can
also be divided by 3 (for explanation, here we further let 5'3kl ). This ensures
that the row vectors of every igDTU 3 can be separated into 5'k sets of },,{ RQP
combinations. The larger the 5k , the more the number of noise rows.
Consequently, more computational effort will be needed to process the disguised
dataset. The probability function ),( baU is a configurable security parameter
that is managed by Bob; a desired level of security can be achieved by
controlling a and b .
In step (11), assume
)'(
1
)'(
1
3
1
2
1
1
1
)'(3
1
33
1
32
1
31
1
)'(2
1
23
1
22
1
21
1
)'(1
1
13
1
12
1
11
1
nnlnnllll
nn
nn
nn
uuuu
uuuu
uuuu
uuuu
U
,
)'(
2
)'(
2
3
2
2
2
1
2
)'(3
2
33
2
32
2
31
2
)'(2
2
23
2
22
2
21
2
)'(1
2
13
2
12
2
11
2
nnlnnllll
nn
nn
nn
uuuu
uuuu
uuuu
uuuu
U
, …,
121
)'(
)'(
)'(
)'(
3
)'(
2
)'(
1
)'(
)'(3
)'(
33
)'(
32
)'(
31
)'(
)'(2
)'(
23
)'(
22
)'(
21
)'(
)'(1
)'(
13
)'(
12
)'(
11
)'(
nnl
nn
nnl
nn
l
nn
l
nn
l
nn
nn
nnnnnn
nn
nn
nnnnnn
nn
nn
nnnnnn
nn
uuuu
uuuu
uuuu
uuuu
U
.
In step (12), assume
)'()3(
1
)'(
1
3
1
2
1
1
1
)'(3
1
33
1
32
1
31
1
)'(1
1
13
1
12
1
11
)'(1
3
13
3
12
3
11
3
)'(1
2
13
2
12
2
11
2
)'(1
1
13
1
12
1
11
1
1
nnlnnllll
nn
nn
nn
nn
nn
gD
uuuu
uuuu
uuuu
gggg
gggg
gggg
U
,
)'()3(
2
)'(
2
3
2
2
2
1
2
)'(3
2
33
2
32
2
31
2
)'(1
2
13
2
12
2
11
)'(2
3
23
3
22
3
21
3
)'(2
2
23
2
22
2
21
2
)'(2
1
23
1
22
1
21
1
2
nnlnnllll
nn
nn
nn
nn
nn
gD
uuuu
uuuu
uuuu
gggg
gggg
gggg
U
, …,
)'()3(
)'(
)'(
)'(
3
)'(
2
)'(
1
)'(
)'(3
)'(
33
)'(
32
)'(
31
)'(
)'(1
)'(
13
)'(
12
)'(
11
)')('(
3
3)'(
3
2)'(
3
1)'(
3
)')('(
2
3)'(
2
2)'(
2
1)'(
2
)')('(
1
3)'(
1
2)'(
1
1)'(
1
)'(
nnl
nn
nnl
nn
l
nn
l
nn
l
nn
nn
nnnnnn
nn
nn
nnnnnn
nnnnnnnnnn
nnnnnnnnnn
nnnnnnnnnn
gD
uuuu
uuuu
uuuu
gggg
gggg
gggg
U nn
.
In step (13), for illustration we let
)3()3(
3
0001
1000
0100
0010
ll
T
.
122
Then in step (14),
)'()3(
)'(1
1
13
1
12
1
11
1
1
)'(
1
3
1
2
1
1
)'(1
3
13
3
12
3
11
3
)'(1
2
13
2
12
2
11
2
13
nnl
nn
nnllll
nn
nn
gDT
gggg
uuuu
gggg
gggg
U
,
)'()3(
)'(2
1
23
1
22
1
21
1
2
)'(
2
3
2
2
2
1
)'(2
3
23
3
22
3
21
3
)'(2
2
23
2
22
2
21
2
23
nnl
nn
nnllll
nn
nn
gDT
gggg
uuuu
gggg
gggg
U
, …,
)'()3(
)')('(
1
3)'(
1
2)'(
1
1)'(
1
)'(
)'(
)'(
3
)'(
2
)'(
1
)')('(
3
3)'(
3
2)'(
3
1)'(
3
)')('(
2
3)'(
2
2)'(
2
1)'(
2
)'(3
nnl
nnnnnnnnnn
nn
nnl
nn
l
nn
l
nn
l
nnnnnnnnnn
nnnnnnnnnn
gDT
gggg
uuuu
gggg
gggg
U nn
.
Then in step (15),
.
This table records the change of row orders in every igDTU 3 , for )'(...,,1 nni .
How privacy is preserved.
Upon receipt of MdTTX12
' and 1T from Alice, Bob can compute T
MdTTdMT TXX 1122''
(T
T1 is the transpose matrix of 1T ) to reverse 1T ’s swap effect in MdTTX12
' . He can
then infer the real values of ix s using dMTX2
' . The probability for Bob to successfully
infer X is )1'()1')('(
1
nnnnn , it approximates
nnn )'(
1
as nn ' . This
probability is negligible when 'n is sufficiently large. The probability for Bob to
successfully infer the intermediate result },,{ RQP is the same as the probability of
inferring X . As Alice manages the security parameter 'n , the larger the value of 'n
the more difficult it is for Bob to infer X and },,{ RQP .
123
How random matrices nnUU '1 ...,, are chosen.
In this protocol, iU s are used to conceal the values of intermediate computation
results },,{ RQP in gDiU s. As nRQP , these iU s should be chosen in a such
way that the row sum of any three rows in each gDiU should be equal to )'( nn ,
otherwise the level of security would be lowered. This is because all these data
matrices, 1
)'(}{ 3
i
nn
gDT iU , will be sent to Alice at the next stage of computation. If some
row combinations with row sums not equal to )'( nn exist, the chance for Alice to
infer },,{ RQP from these matrices would be higher. (According to the nature of sign
test, the value of )( RQP should be equal to n . In a case when nRQP , it
is not valid for sign test computation.)
5.3.2.4 Data Extraction Protocol (DEP)
The data extraction protocol (DEP) is designed to reverse the data obscuring
operations performed on 1
)'(}{ 3
i
nn
gDT iU by DOP. It further computes the sign test results
on 1
)'(}{ 3
i
nn
gDT iU . Assuming that Alice executes this protocol, the protocol details can
be described as follows.
Data Extraction Protocol (DEP)
Input: RTab2 (from Alice);
1
)'(}{ 3
i
nn
gDT iU (from Bob)
Output: 1
)1)(2)(3(}{
i
lll
T
iR
(1) Use RTab2 to extract
'
)'(
'
1 ...,, x
nn
x VV from )'(313 ...,, nngDTgDTUU . Here
'
)'(
'
1 ...,, x
nn
x VV
are column vectors where ix' s are stored.
(2) Compute a )'()3( nnl data matrix Ud by merging '
)'(
'
1 ...,, x
nn
x VV , where
'x
iV is the thi column vector of Ud , for )'(...,,1 nni .
(3) Compute TddTTUU
T
11 .
(4) Divide UTdT1 into two parts, i.e. ]|[ 111 UUU
TTT dT
a
dT
x
dT , where U
TdT
x1 is a data
matrix of dimension nl )3( and UTdT
a1 is a data matrix of dimension
')3( nl .
(5) Calculate the column vector of the row sum of UTdT
x1 by computing
124
rs
l
dT
x
rsdT
x
n
dT
x
rsdT
x
u
u
UUT
T
TT
3
1
1
1
1
11
1
1
, rs
i
dT
x uT
1 is the thi row sum of UTdT
x1 , for
)3(...,,1 li .
(6) Generate all )1)(2)(3( lll combinations of },,{ RQP from 1
)3(}{ 1
i
l
dT
x UT
.
We further assume that 1C , …, )1)(2)(3( lllC represent these combinations and
dIc1 , …, dI
lllc )1)(2)(3( represent indices of these combinations.
(7) Perform sign test using 1
)1)(2)(3(}{
i
llliC , along with n , , z and the standard
normal distribution table to generate all disguised test results in relation to 1
)3(}{
i
liC . Assuming that T
iR is the disguised sign test result for iC ,
1
)1)(2)(3(}{
i
lll
T
iR are is the set of all disguised test results in relation to
1
)1)(2)(3(}{
i
llliC .
(8) Output 1
)1)(2)(3(}{
i
lll
T
iR and 1
)1)(2)(3(}{
i
lll
I
idc .
Figure 23. The DEP algorithm. (Source: Author’s own)
The rationales of key computation steps and the formats of the computational
elements in Figure 23 are further explained below.
In steps (1) - (2), according to RTab2 , the first column vector of Ud is the first
column vector of )'(3 nngDTU , the second column vector of Ud is the second
column vector of 33gDTU , the third column vector of Ud is the third column
vector of 23gDTU , …, the thnn )'( column vector of Ud is the
thnn )'( column vector of 13gDTU . That is
)'()3(
)'(1
1
23
1
32
1
1)'(
1
1
)'(
2
3
3
2
)'(
1
)'(1
3
23
3
32
3
1)'(
3
)'(1
2
23
2
32
2
1)'(
2
nnl
nnnn
nnlll
nn
l
nnnn
nnnn
d
gggg
uuuu
gggg
gggg
U
.
125
In step (3),
as ,
then
)'()3(
32
1
1)'(
1
23
1
)'(1
1
3
2
)'(
1
2
3
1
)'(
32
3
1)'(
3
23
3
)'(1
3
32
2
1)'(
2
23
2
)'(1
2
1
nnl
nnnn
l
nn
llnnl
nnnn
nnnn
dT
gggg
uuuu
gggg
gggg
UT
.
In step (4),
assuming that
)')(3()1)(3()3(2)3(1)3(
)'(3)3(333231
)'(2)2(222221
)'(1)1(111211
1
nnlnlnlll
nnnn
nnnn
nnnn
dT
uuuuu
uuuuu
uuuuu
uuuuu
UT
,
then
nlll
n
n
n
dT
x
uuu
uuu
uuu
uuu
UT
)3(2)3(1)3(
33231
22221
11211
1
and
)')(3()1)(3(
)'(3)3(3
)'(2)2(2
)'(1)1(1
1
nnlnl
nnn
nnn
nnn
dT
a
uu
uu
uu
uu
UT
.
In step (5),
rs
l
dT
x
rsdT
x
rsdT
x
rsdT
x
n
i il
n
i i
n
i i
n
i i
n
dT
x
rsdT
x
u
u
u
u
u
u
u
u
UU
T
T
T
T
TT
)3(
3
2
1
1 )3(
1 3
1 2
1 1
11
1
1
1
11
1
1
1
1
,
where rs
i
dT
xuT
1 is the possible value for P , Q or R , for )3(...,,1 li .
126
In steps (6) - (8), the relationship among iC , Id
ic and T
iR is described in Figure 24.
Figure 24. The detailed relationship among iC , Id
ic andT
iR .
(Source: Author’s own)
How privacy is preserved.
This protocol prevents Alice from knowing },,{ RQP and Y . The probability for
Alice to successfully guess },,{ RQP is )1)(2)(3(
1
lll and this probability
converges to 3
1
l. Bob can control the security level by increasing or decreasing the
value of l . The real values of differences between ix and iy have been transformed
to ),,( iii RQP , where }1,0{,, iii RQP , for )'(...,,1 nni . In the event where Alice
has successfully guessed },,{ RQP , she may still be able to work out Y . To
compromise the individual privacy of Y , Alice needs to work out if the data pairs in
the joint dataset YX are equal or not, i.e. Alice has to find the value of Q . The
probability for Alice to successfully guess these identical pairs is )!(
)!(!3 nl
QnQ
. For
those pairs of ii yx or ii yx , the only information Alice can work out is the
relative relations for each ),( ii yx , rather than the actual value of iy .
5.3.2.5 Permutation Reverse Protocol (PRP)
The permutation reverse protocol (PRP) is designed to reverse the result of the data
obscuring operations performed by STCP and to single out the final sign test result.
127
As we assume STCP is executed by Bob, here, PRP is also executed by Bob to
reverse the permutation effect of 3T in 1
)'(}{ 3
i
nn
gDT iU . This protocol can be detailed as
follows.
Permutation Reverse Protocol (PRP)
Input: 1
)1)(2)(3(}{
i
lll
I
idc and
1
)1)(2)(3(}{
i
lll
T
iR (from Alice); RTab3 (from Bob).
Output: The final sign test result finalTR .
(1) Find dI
Fc from 1
)1)(2)(3(}{
i
lll
I
idc using RTab3 , where dI
Fc is the index of actual
},,{ RQP
(2) Find finalTR from 1
)1)(2)(3(}{
i
lll
T
iR using dI
Fc .
Figure 25. The PRP algorithm. (Source: Author’s own)
In this final protocol, Bob uses RTab3 to reverse the permutation operation on
1
)'(}{ 3
i
nn
gDT iU that has been performed by STCP, thus identifying the final test result
from 1
)1)(2)(3(}{
i
lll
T
iR . Each T
iR represents the test result for the corresponding
possible combinations of },,{ RQP . 1
)1)(2)(3(}{
i
lll
I
idc are indices for
1
)1)(2)(3(}{
i
lll
T
iR .
By referring to RTab3 and 1
)1)(2)(3(}{
i
lll
I
idc , dI
Fc can be identified, thus finalTR can be
singled out from 1
)1)(2)(3(}{
i
lll
T
iR .The computation result of each T
iR is an integer
value, where 1T
iR represents “reject 0H ” and 0T
iR represents “do not reject
0H ”. As Bob does not know any information about how Alice produces this index,
the only way he can infer 1
)1)(2)(3(}{
i
lll
I
idc is by guessing. The probability for Bob to
infer 1
)1)(2)(3(}{
i
lll
T
iR is )1)(2)(3(
1
lll. At this point, it is virtually impossible for
Bob to infer any information about either X or },,{ RQP . In the case where Bob
correctly identifies the final test result but intentionally sends the wrong result to
Alice, although this computation has been compromised, the privacy of X and
},,{ RQP are still preserved.
128
5.4 The P22NSTP Protocol Suite and Its Operation
On the basis of the components described above, we can now construct the P22NSTP
protocol suite. Based on the assumption that Alice initiates this computation, the
protocol description is detailed in the following subsection.
5.4.1 Operation of the P22NSTP Protocol Suite
The P22NSTP Protocol Suite
Input: (1) X from Alice.
(2)Y from Bob.
(3) and z are negotiated and determined by both Alice and Bob.
Output: Both Alice and Bob obtain the final computation result.
(1) Alice executes DOP. (Input: X ; Output: MdTTX12
' .)
(2) Alice sends MdTTX12
' and 1T to Bob.
(3) Bob executes STCP. (Input: MdTTX12
' , Y and 1T ; Output: 1
)'(}{ 3
i
nn
gDT iU .)
(4) Bob sends 1
)'(}{ 3
i
nn
gDT iU to Alice.
(5) Alice executes DEP. (Input: 1
)'(}{ 3
i
nn
gDT iU and RTab2 ; Output:
1
)1)(2)(3(}{
i
lll
T
iR and
1
)1)(2)(3(}{
i
lll
I
idc .)
(6) Alice sends 1
)1)(2)(3(}{
i
lll
T
iR and 1
)1)(2)(3(}{
i
lll
I
idc to Bob.
(7) Bob executes PRP. (Input: 1
)1)(2)(3(}{
i
lll
T
iR , 1
)1)(2)(3(}{
i
lll
I
idc and RTab3 ; Output:
finalTR .)
(8) Bob sends finalTR to Alice.
Figure 26. The P22NSTP protocol suite operation.
(Source: Author’s own)
5.4.2 Correctness
After receiving }...,,{ )'(313 nngDTgDTUU from Bob, Alice starts to perform the reverse
procedure. jgDTU 3 is the disguised comparison result of the thj row of MdTTX
12'
and the thj row of 1'' TYY , contains only information in relation to a specific
)''( ii yx , for )'(...,,1 nni . As RTab2 records ix ' ’s new column order after
129
swapping, by referring to RTab2 , the column vector where ix ' ’s information is stored
can be extracted from jgDTU 3 , i.e. }...,,{ '
)'(
'
1
X
nn
X VV can be singled out from
}...,,{ )'(313 nngDTgDTUU . Assuming that Ud is a )'()3( nnl matrix where the thi
column vector is 'X
iV , Ud is the disguised comparison result of YX . In addition to
the information of YX , nnnl )'()3( noise data items have been added, the
row order and the column order have also been distorted.
According to DOP and STCP, the distortion of column order in Ud is contributed by
1T . By referring to CTab1 , the pure noise matrix can be singled out from the Ud .
Namely, UTdT1 can be separated into U
TdT
x1 and U
TdT
a1 , where U
TdT
a1 is the pure noise
matrix. UTdT
x1 is a matrix of dimension nl )3( . In U
TdT
x1 , three out of the )3( l
row vectors are the P , Q and R for the sign test computation; the rest l rows are
noises. The row order of UTdT
x1 have already been swapped by 3T . As Alice does not
know the exact positions of P , Q and R rows, she performs sign tests for all
)1)(2)(3( lll row combinations. The index for sign test results in relation to
their corresponding rows has also been recorded, i.e. 1
)1)(2)(3(}{
i
lll
T
iR and
1
)1)(2)(3(}{
i
lll
I
idc .
After receiving 1
)1)(2)(3(}{
i
lll
T
iR and 1
)1)(2)(3(}{
i
lll
I
idc from Alice, by referring to RTab3 ,
Bob can reverse the row swap in UTdT
x1 . The correct P , Q and R rows can then be
singled out. Hence the correct final test result can be identified correctly.
5.4.3 Protocol Analysis
By invoking the building blocks described in section 5.3, we can see that:
(1) X is concealed in MdTTX12
' by DOP: the probability for Bob to successfully infer
X from MdTTX12
' is approximately nnn )'(
1
. This probability is controlled by the
security parameter 'n which is managed by Alice.
(2) At no point does Alice have the opportunity to access the data items of Y directly.
130
(3) },,{ RQP is concealed in 1
)'(}{ 3
i
nn
gDT iU with STCP: the probability for Alice to
successfully guess },,{ RQP is approximately 3
1
l. This probability is controlled
by the security parameter l which is managed by Bob.
(4) DEP and PRP are performed by Alice and Bob, respectively, so as to remove the
data disguising effects carried out by STCP and DOP.
5.4.3.1 Privacy Analysis against Privacy Requirements
This section analyzes the level of privacy provided by the P22NSTP using the
information entropy method provided by [AGRA’01]. This analysis can be divided
into two aspects:
(1) The privacy protection provided by DOP. After receiving MdTTX12
' and 1T from
Alice, what is the difficulty Bob has to overcome in order to infer X ?
(2) The privacy protection provided by STCP. After receiving 1
)'(}{ 13
i
nn
gDTU from Bob,
what is the difficulty Alice has to overcome in order to infer the correct
intermediate result P ,Q , R ?
5.4.3.2 Quantify Privacy Level Using Entropy
Privacy Protection Provided by DOP
Alice adds 2)'( 2 nn noise data items into the disguised data matrix, i.e. MdTTX12
' ,
where 'n is a security parameter managed by Alice. She then sends MdTTX12
' and 1T to
Bob. In the design of the P22NSTP computation, Alice will also need to send the
permutation matrix 1T to Bob. As it is assumed that Alice and Bob both know detail
of the P22NSTP protocol suite, he can remove the effect of 1T by calculating
TT1 . Thus
only nnnn )'( effective noise data items are left in the disguised data matrix. The
data compromising task for Bob is reduced to identifying n data inputs out of
)'( nnn inputs.
According to Shannon’s entropy definition for discrete cases, we can calculate
MdTTX12
' ’s differential entropy as:
131
'
1 )'',(2)'',()'',( ))((log)()'(12
nn
i inNnNinNnNMdTTnNnN xpxpXh ,
where N is the size of X , 'N is the security parameter managed by Alice, and
)()'',( inNnN xp is the probability for Bob to guess ix . Assuming that nxxx ...,,, 21 are
original data inputs and )'(21 ...,,, nnnnn xxx are noise data items, the probability for
Bob to infer ix is )1'(
1)()'',(
innxp inNnN
, for ni ...,,2,1 ;
for )'(...,),2(),1( nnnni ,
'
)1'(
11
'
)(1)(
11 )'',(
)'',(n
inn
n
xpxp
n
in
i inNnN
inNnN
,
such that 1)('
1 )'',(
nn
i inNnN xp .
The analysis is performed by assuming sample size
}100000,10000,1000,100,90,80,70,60,50,40,30,20,10{N ,
and the number of noise data items }10,9,8,7,6,5,4,3,2,{' NNNNNNNNNNN .
Table 8 shows the entropy values in this setting.
Table 8. The entropy values. (Source: Author’s own)
132
From Table 8, it can be seen that when the number of sample size N is fixed, adding
more noise data items leads to higher entropy. Similarly, under a fixed level of noise
addition, the bigger the sample size the higher the entropy value is. These
observations match with the assumption of entropy: Entropy represents the
information content of a dataset; the entropy after data sanitization should be higher
than the entropy before the sanitization. Table 8 can be further illustrated in Figures
27, 28 and 29. In these figures, each line represents the entropy trend for a specific
sample size where the level of noise addition is increasing.
Figure 27 compares the entropy values for sample sizes
}100,90,80,70,60,50,40,30,20,10{N against 10 different levels of noise
addition, i.e. }10,9,8,7,6,5,4,3,2,{' NNNNNNNNNNN . Figure 28 compares
the entropy values for sample sizes }1000,100,10{N against 10 different levels of
noise addition, i.e. }10,9,8,7,6,5,4,3,2,{' NNNNNNNNNNN . Figure 29
compares the entropy values for sample sizes }100000,10000,1000{N against 10
different levels of noise addition, i.e.
}10,9,8,7,6,5,4,3,2,{' NNNNNNNNNNN . It can be observed from these
figures that the increase rate of entropy is higher when the sample size is smaller.
Figure 27. The entropy value versus the number of noise data items (N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100). (Source: Author’s own)
133
Figure 28. The entropy value versus the number of noise data items (N = 10, 100, 1000). (Source: Author’s own)
Figure 29. The entropy value versus the number of noise data items (N = 1000, 10000, 100000). (Source: Author’s own)
134
Table 9. A table of sample size versus the increments of entropy when the level of noise data item addition is increased. (Source: Author’s own)
Table 9 compares the increments of entropy when the number of noise data items is
increased under different sample sizes. Two interesting facts can be observed:
(1) The same level of noise addition leads to a similar increment of entropy value.
This can be observed by reviewing the table column by column.
(2) The increment of entropy decreases as the number of noise data item increases.
This implies that, after a certain threshold value, further increases of noise data
items may not lead to a significant increase in entropy. This observation illustrates
that, for a given dataset, there is a critical value for the number of noise data items
that should be added. Beyond this critical value, any further increases in the noise
data items may only lead to performance degradation rather than privacy level
enhancement. This property is crucial when applying the solution to solving a
real-life problem, for example, to perform the computation under limited
computational power.
135
Privacy Protection Provided by STCP
To analyse the probability for Alice to infer },,{ RQP from 1
)'(}{ 13
i
nn
gDTU , for the sake
of explanation, we further set 73kl , thus the total number of rows in a igDTU 3 will
be )1(3 7 k . The probability for Alice to infer the correct column in igDTU 3 is
)1(6
1
6
1
)1(
1
77
kk ; the probability for Alice to infer the correct },,{ RQP from
1
)'(}{ 13
i
nn
gDTU is )'(
7
))1(6
1( nn
k
. This probability is dependent on sample size n ,
Alice’s security parameter 'n and 7k , where
333 5
7
nkkl . As the effects of n
and 'n have been analysed in the previous subsection, here we focus on investigating
the effect of 7k . The probability for Alice to infer the correct column in each igDTU 3 ,
i.e. )1(6
1
7 k, is analysed below.
In the case where Alice does not single out the correct values of },,{ RQP , there are
two possibilities:
(1) Alice selects the correct combination of },,{ RQP but incorrect order. That is,
Alice chooses the correct data items from the column she draws from a igDTU 3 , but
arranges them in the wrong order. For example, assuming that ip , iq and ir are
chosen, there are six combinations of the three data items, i.e. },,{ iii rqp ,
},,{ iii qrp , },,{ iii rpq , },,{ iii prq , },,{ iii qpr and },,{ iii pqr , only },,{ iii rqp is
correct. The probability for getting each of the wrong combinations is )1(6
1
7 k.
(2) In a case when Alice selects wrong },,{ RQP combinations from noise data items.
This has 77 66 kk combinations. The probability for each wrong combination
is also )1(6
1
7 k. (
)1(6
1)
6
1)(6
)1(6
11(
777
kkk.)
Assuming that 1I is the event of successfully guessing the correct values of ip , iq
and ir ; 632 ...,,, III are the events of (1), and )1(637 7...,,, kIII are the events of (2) ,
136
then )1(6
1)(
7
kIp i for )1(6...,,2,1 7 ki , such that 1)(
)1(6
1
7
k
i iIp . As
)1(6
1 2
7
))((log)()(k
i ii IpIpIh , then
))1(6(log))1(6
1(log)
)1(6
1))(1(6())((log)()( 72
7
2
7
7
)1(6
1 2
7
k
kkkIpIpIh
k
i ii .
The analysis of entropy values provided by of 7k is performed by setting the security
parameter
}10000,9000,8000,7000,6000,5000,4000,3000,2000
,1000,900,800,700,600,500,400,300,200,100,90,80,70,60,50,40,30,20,10{7 k.
The entropy value is calculated by averaging the experimental results from executing
the program 100 times.
Tables 10, 11 and 12 show the values of entropy against different values of 7k . In this
case, )(Ih is completely dependent on the value of 7k . The value of entropy increases
as 7k increases. This result is within our previously discussed expectation. The
entropy )(Ih also has the same properties as )()'',( Xh nNnN : (1) it has a similar level
of entropy increment for different scales of 7k , and (2) the increment of entropy
decreases as 7k increases. These two properties can make a significant contribution to
finding a practical value of 7k under specific considerations.
Table 10. The entropy value and the increment versus 7k . (k7 = 10, 20, 30, 40, 50, 60,
70, 80, 90, 100.) (Source: Author’s own)
137
Table 11. The entropy value and the increment versus 7k . (k7 = 100, 200, 300, 400,
500, 600, 700, 800, 900, 1000.) (Source: Author’s own)
Table 12. The entropy value and the increment versus 7k . (k7 = 1000, 2000, 3000,
4000, 5000, 6000, 7000, 8000, 9000, 10000.) (Source: Author’s own)
Figure 30. The entropy value versus the value of 7k . (k7 = 10, 20, 30, 40, 50, 60, 70, 80,
90, 100.) (Source: Author’s own)
138
Figure 31. The entropy value versus the value of 7k . (k7 = 100, 200, 300, 400, 500, 600,
700, 800, 900, 1000.) (Source: Author’s own)
Figure 32. The entropy value versus the value of 7k . (k7 = 1000, 2000, 3000, 4000,
5000, 6000, 7000, 8000, 9000, 10000.) (Source: Author’s own)
139
Figure 33. The entropy value versus the value of k7. (Source: Author’s own)
Figures 30, 31 and 32 plot the entropy values versus the values of 7k . Figure 33
further summarizes entropy values as 7k increases from 10 to 10000. The level of
increase in entropy decreases when 7k increases. An interesting observation made
from Figures 30, 31 and 32 is that, although the scales of 7k are different in these
three graphs, the three curves show a similar trend.
This subsection has discussed the levels of privacy provided by DOP and STCP. In
DOP, for each dataset X , the level of privacy can be managed by adding a different
number of noise data items. The more noise data items are added, the higher the level
of privacy that can be afforded. A higher level of noise data will undoubtedly lead to
more computational overhead. However, our investigation here has also provided
some insight into the relationship between the effectiveness of the privacy-preserving
capability of the protocol and the number of noise data items used. In cases where the
computation power and computational overhead are of major concern, our protocol
also provides functionality for choosing an affordable number for noise data addition.
An optimal number of noise data items can be determined by evaluating the
increments in entropy versus the number of noise data items used.
140
The privacy level provided by STCP is dependent on the security parameter 5k . This
is because
33 5 nk
l and each 5k is randomly drawn from a uniform distribution
)100,1(U . (Please refer to Figure 22.) The level of 5k can be controlled by managing
the width of the uniform distribution, for example, to use )50,1(U will lead to lower
5k diversity than to use )100,1(U . The larger the value of 5k , the higher the privacy
level that can be achieved by STCP. This property provides an option for choosing an
affordable 5k , which can also be found in DOP. The privacy level provided by DOP
is dependent on the security parameter 4k , since nkn 4' and each 4k is also
randomly drawn from )100,1(U . (Please refer to Figure 21.)
5.4.3.3 Computational Overhead
In a P22NSTP protocol suite execution, 29)'(6)1)(2)(3(2 lnnlll
computational operations are performed, including 20)'(5 nn data perturbation
operations and 9')1)(2)(3(2 lnnlll algebraic operations. (e.g. mapping
table generation, index generation and sign test computation.) The computation cost is
dependent on three factors: the data size n , Alice’s security parameter 'n and Bob’s
security parameter l .
According to our protocol design, 'n and l are both much larger than n .
Consequently, a large n will lead to a great value of 'n and l , respectively. Owing to
this fact, the simulation results can only be acquired when sample size N =10, 20 and
30. Our program can not execute simulation when N≧40 as the data size exceeds the
maximum number of elements in a real double array, i.e. 14108.2 . The number of
computational operations is calculated by averaging the experimental results from
executing the program 100 times.
Figure 34 shows the relationship between the number of computational operations
versus the number of noise data items added by Alice and Bob, respectively, where
the sample size N =10, 20, and 30. The number of computational operations increases
exponentially when 'n and l increase. The more the sample size, the faster the rate of
increase of the number of computational operations.
141
Figure 34. Number of computational operations vs. number of noise data items added by Alice and Bob. (Sample size N = 10, 20, 30.) (Source: Author’s own)
5.4.3.4 Communication Overhead
In total, four messages are generated and transmitted during the protocol suite
execution. Two messages are transmitted by Alice and the other two are by Bob,
respectively. Assuming that I is the number of bits used to represent the value of
each of the plain text data items (e.g. input/output data items), it can be calculated that
the total communication overhead for a P22NSTP protocol suite execution is
Illllnnnn )1)1)(2)(3()3)('()'(2( 22 , where n is the size of both datasets
X and Y , 'n is the security parameter managed by Alice and l is the security
parameter managed by Bob.
Figure 35 shows the relationship between the communication overhead versus the
number of noise data items added by Alice and Bob, respectively, where the sample
size N =10, 20, and 30. The communication overhead is calculated by averaging the
experimental results from executing the program 100 times. The communication
overhead increases approximately linearly when 'n and l increase. The bigger the
sample size, the faster the rate of increase of the communication overhead.
142
Figure 35. Total communication overhead vs. number of noise data items added by Alice and Bob. (Sample size N = 10, 20, 30.) (Source: Author’s own)
5.4.3.5 Protocol Suite Execution Time
As data randomization technique is used in the design of the P22NSTP protocol suite,
the 'n and l generated by each single run of simulation will be different. As a result,
the execution time is calculated by averaging the experimental results from executing
the program 100 times.
Figure 36 shows the relationship between the execution time versus the number of
noise data items added by Alice and Bob, respectively, where the sample size N =10,
20, and 30. The execution time is calculated by averaging the experimental results
from executing the program 100 times. The execution time increases exponentially
when 'n and l increase. The bigger the sample size, the faster the rate of increase of
the execution time.
143
Figure 36. Protocol suite execution time vs. number of noise data items added by Alice and Bob. (Sample size N = 10, 20, 30.) (Source: Author’s own)
5.5 Chapter Summary
This chapter has presented the detailed design of the P22NSTP protocol suite. By
specifically designing the local computational tasks and making use of the data
perturbation techniques, the two parties can perform the sign test computation
securely. The privacy-preserving features of the protocol suite have been analysed
based on protocol components. The correctness, the level of privacy provided, the
computational cost, the communication cost and the execution time have also been
theoretically and experimentally analysed. A comparison of TTP-NST, the P22NSTP
and the P22NSTC will be presented in chapter 7.
144
Chapter 6 A Novel Privacy-preserving Two-party
Nonparametric Sign Test Protocol Suite
Using Cryptographic Primitives (P22NSTC)
6.1 Chapter Introduction
This chapter details the design of the P22NSTC protocol suite. Based on this design,
the security threats against privacy requirements are analysed. The correctness, the
level of privacy it provides, the computational overhead and communication overhead
are also theoretically analysed in this chapter. The overview of the P22NSTC protocol
suite is described in section 6.2. Section 6.3 presents the detailed design, including
computation participants, message objects, components of the P22NSTC protocol suite
and elements of the computation. Finally, section 6.4 describes the operation of this
protocol and discusses the correctness, its privacy performance, computational
overhead and communication overhead.
6.2 Overview of the P22NSTC Protocol Suite
This protocol suite is designed for two parties, Alice and Bob, to collaboratively
perform the sign test computation with the assistance from an on-line STTP while
preserving the individual data confidentiality, individual privacy and corporate
privacy of X and Y , where }...,,{ 1 nxxX , }...,,{ 1 nyyY and X and Y are
vertically partitioned. An on-line semi-trusted third party (STTP) is employed in the
design of this protocol suite. Based on the design requirements specified in chapter 3,
the P22NSTC computation is divided into ten local computation tasks. This protocol
suite incorporates two novel protocols that are designed based on data randomization,
data swapping and data transformation techniques. They are data separation protocol
(DSP) and data randomization protocol (DRP). DSP enables the STTP to randomly
split a dataset into two datasets. DRP enables the STTP to generate a randomised
dataset based on a dataset. An additively homomorphic cryptosystem is also used to
support the secure computation of the P22NSTC. On the one hand, the additively
homomorphic cryptosystem is used by both Alice and Bob to encrypt their datasets
before sending them to the STTP. On the other hand, DSP and DRP are used by the
145
STTP to disguise the intermediate computational results before sending them to Alice
and Bob, respectively. The STTP performs computations on both encrypted and
disguised data. The intermediate computational results generated by the
computational tasks are either sent to the other party or kept for the next local
computation. Eight messages are transmitted between the parties and the STTP
throughout the entire computation process.
Before the protocol execution, Alice and Bob first negotiate an additively
homomorphic cryptosystem, i.e. ))(),(,,( skpk DEskpk , where pk is the public key,
sk is the private key, )(pkE is the encryption algorithm and )(skD is the decryption
algorithm. A data transformation function, )(dO , is also negotiated, where
0,1
0,0
0,1
)(
d
d
d
dO .
Local Computational Task 1
Alice encrypts X and generates 1)}({ i
nipk xE .
Local Computational Task 2
Bob encrypts Y and generates 1)}({ i
nipk yE .
Local Computational Task 3
After receiving 1)}({ i
nipk xE and 1)}({ i
nipk yE , the STTP first computes
)()()( iipkipkipki yxEyExEe , for ni ...,,1 . Assuming that },...,{ 1 neeG , the
STTP then executes DSP using G as data input, and generates 1G and 2G , where 1G
and 2G are two vector datasets, 1G has 1n data items, 2G has 2n data items and
21 nnn . Here we further assume that }...,,{ 11
11 1neeG and }...,,{ 22
12 2neeG .
Local Computational Task 4
Using 1G as the data input, the STTP executes DRP and generates 1
1"GT
G , where 1GT
is a permutation matrix of dimension )'()'( 1111 nnnn that is used to distort the
146
order of the data items in 1
1"GT
G . Again, using 2G as the data input, the STTP
executes DRP and generates 2
2" GTG , where
2GT is a permutation matrix of dimension
)'()'( 2222 nnnn that is used to distort the order of the data items in 2
1"GT
G . DRP
not only adds 1'n noise data items into 1G , but also swaps the data item order of the
resulting data vector; also, DRP adds 2'n noise data items into 2G , and swaps the data
item order of the resulting data vector. 1
1"GT
G and 2
2" GTG are two vector datasets of
dimension )'(1 11 nn and )'(1 22 nn , respectively. Here we assume that
}"...,,"{" 1
)'(
1
11 11
1
nn
TeeG G
and }"...,,"{" 2
)'(
2
12 22
1
nn
TeeG G
. The STTP then sends 1
1"GT
G
to Alice and 2
2" GTG to Bob.
Local Computational Task 5
After receiving 1
1"GT
G from the STTP, Alice uses )(skD to decrypt every data item in
the dataset. 1
)'(
1
11}"{
i
nnide is then generated where 1"ide is the decryption result of the
thi data item of 1
1"GT
G , for )'(...,,1 11 nni .
Local Computational Task 6
By using )(dO , Alice transforms 1
)'(
1
11}"{
i
nnide into 1
)'(
1
11}{
i
nnio . 1
io is an integer value,
for )'(...,,1 11 nni . According to the transformation function, 1
io is either 1, 0 or -1.
Alice then sends 1
)'(
1
11}"{
i
nnide to the STTP.
Local Computational Task 7
After receiving 2
2" GTG from the STTP, Bob uses )(skD to decrypt every data item in
the dataset. 1
)'(
2
22}"{
j
nnjde is then generated where 2" jde is the decryption result of the
thj data item of 2
2" GTG , for )'(...,,1 22 nnj .
147
Local Computational Task 8
By using )(dO , Bob transforms 1
)'(
2
22}"{
j
nnjde into 1
)'(
2
22}{
j
nnjo . 2
jo is an integer value,
for )'(...,,1 22 nnj . According to the transformation function, 2
jo is either 1, 0 or
-1. Alice then sends 1
)'(
2
22}"{
j
nnjde to the STTP.
Local Computational Task 9
After receiving 1
)'(
1
11}{
i
nnio and 1
)'(
2
22}{
j
nnjo , the STTP reverses the effects of DRP from
1
)'(
1
11}{
i
nnio and 1
)'(
2
22}{
j
nnjo , by referring to 1GT and
2GT , respectively. 11
1}'{ i
niro and
12
2}'{ j
njro are then generated. 11
1}'{ i
niro is the decrypted and transformed dataset
(transformed by )(dO ) of 11
1}{ i
nie ; 12
2}'{ j
njro is the decrypted and transformed dataset
of 1
2
2}{ j
nje .
Local Computational Task 10
The STTP computes },,{ RQP using 11
1}'{ i
niro and 12
2}'{ j
njro . The STTP then
computes the sign test result using },,{ RQP . Finally, the STTP sends the final test
result to Alice and Bob, respectively.
Figure 37 provides an overview of the P22NSTC computation.
148
Figure 37. An overview of the P22NSTC computation.
(Source: Author’s own)
6.3 The Design in Detail
6.3.1 Computation Participants and Message Objects
Computation Participants:
The protocol is designed for two parties, Alice and Bob, who hold dataset X and Y
respectively, where }...,,{ 1 nxxX , }...,,{ 1 nyyY and X and Y are vertically
partitioned. An on-line STTP is also used in the protocol suite. The two parties
interact with the STTP to carry out the P22NSTC computation.
149
Message Objects:
Eight messages are transmitted during the entire protocol suite execution:
Message 1: Alice sends the encryption result, i.e. 1)}({ i
nipk xE , and z to the STTP.
That is, },,)}({{ 1
1 zxEmsg i
nipk .
)( ipk xE is a ciphertext. There are )2( n data items in 1msg .
Message 2: Bob sends the encryption result, i.e. 1)}({ i
nipk yE , and z to the STTP.
That is, },,)}({{ 1
2 zyEmsg i
nipk .
)( jpk xE is a ciphertext. There are )2( n data items in 2msg .
Message 3: The STTP sends 1
1"GT
G to Alice. That is, }"{ 1
13GT
Gmsg .
Assuming that }"...,,"{" 1
)'(
1
11 11
1
nn
TeeG G
, there are )'( 11 nn data items
in 3msg .
Message 4: The STTP sends 2
2" GTG to Bob. That is, }"{ 2
24GT
Gmsg .
Assuming that }"...,,"{" 2
)'(
2
12 22
1
nn
TeeG G
, there are )'( 22 nn data
items in 4msg .
Message 5: Alice sends the decryption result, i.e. 1
)'(
1
11}{
i
nnio , to the STTP. That is,
}}{{ 1
)'(
1
5 11
i
nniomsg .
There are )'( 11 nn data items in 5msg .
Message 6: Bob sends the decryption result, i.e. 1
)'(
2
22}{
j
nnjo , to the STTP. That is,
}}{{ 1
)'(
2
6 22
j
nnjomsg .
There are )'( 22 nn data items in 6msg .
150
Message 7: The STTP sends the final test result, i.e. FR , to Alice. That is,
}{7 FRmsg .
There is only one integer value data item in 7msg .
Message 8: The STTP sends the final test result, i.e. FR , to Bob. That is,
}{8 FRmsg .
There is only one integer value data item in 8msg .
6.3.2 Components of the P22NSTC Protocol Suite
The P22NSTc protocol suite consists of three components, they are (1) the additively
homomorphic cryptosystem, (2) the data separation protocol (DSP) and (3) the data
randomization protocol (DRP). The three components are detailed in the following
subsections.
6.3.2.1 Additively Homomorphic Encryption Scheme
An additively homomorphic cryptosystem, i.e. ))(),(,,( skpk DEskpk , is negotiated
by Alice and Bob before the P22NSTC protocol suite execution. It has the property of
)()()( yxEyExE pkpkpk and yxyxED pksk ))(( . It is used by both Alice and
Bob to respectively encrypt X and Y before sending it to the STTP. This prevents
the individual data confidentiality and individual privacy of the joint dataset from
being disclosed to the STTP. Simultaneously, the STTP can still perform a
multiplication operation on the encrypted data items.
6.3.2.2 Data Separation Protocol (DSP)
The data separation protocol (DSP) is designed to enable the STTP to randomly
separate a dataset G into 1G and 2G . Assuming that }...,,{ 1 nddG ,
}...,,{ 11
11 1nddG and }...,,{ 22
12 2nddG , where 21 nnn . The detail of DSP is
described below.
151
Data Separation Protocol (DSP)
Input: G .
Output: 1G and 2G .
(1) STTP randomly selects an integer 1n from ]3
2...,,
3[
nn, with a probability of
)33
2(
1
nn.
(2) STTP computes 12 nnn .
(3) STTP randomly separates G into two subsets, 1G and 2G , where 1G has 1n
data items, 2G has 2n data items, i
jd equals some kd for }...,,1{ nk ,
}2,1{i , and 11 |}...,,1{ inj or 22 |}...,,1{ inj .
Figure 38. The DSP algorithm. (Source: Author’s own)
Here 1G has 1n data items, 2G has 2n data items and 21 nnn . As 1n is chosen
from n...,,1 , 1n may be very small or close to 1 (this leads to a very large 2n ), or its
value is large leading to a very smaller 2n . To avoid such extreme cases, DSP is
designed to select 1n from ]3
2...,,
3[
nn.
6.3.2.3 Data Randomization Protocol (DRP)
The data randomization protocol (DSP) is designed to enable the STTP to randomly
generate a randomised dataset GTG" based on a dataset G . Assuming that
}...,,{ 1 nddG , the STTP first generates a noise dataset }'...,,'{' '1 nddG , where
)1('2
2
nnnn
. The STTP then merges G and 'G . Assuming that
}'...,,',...,,{" '11 nn ddddG , the STTP finally generates GTG" by swapping the order
of "G . The detail of DRP is described below.
152
Data Randomization Protocol (DRP)
Input: G .
Output: GTG'
(1) STTP randomly selects an integer 'n from )]1(,...,2
[2
nn
n.
(2) For '...,,1 ni , STTP randomly chooses it and iu from }...,,1{ n , i.e. the STTP
has )},(...,),,{( ''11 nn utut .
(3) For ',...,1 ni , STTP generates ii uti ddd ' , i.e. STTP has }'...,,'{ '1 ndd . We
further assume that }'...,,'{' '1 nddG .
(4) STTP generates "G by merging G and 'G such that
}'...,,',...,,{" '11 nn ddddG .
(5) STTP generates a permutation matrix GT of dimension )'()'( nnnn .
(6) STTP generates GTG" by computing G
TTGG G "" .
Figure 39. The DRP algorithm. (Source: Author’s own)
The rationales of key computation steps in Figure 39 are further described below.
The purpose of DRP is to enable the STTP to generate the randomised dataset
GTG' based on a dataset G . DOP will be used by the STTP to generate a
randomised dataset base on a ciphertext dataset in the P22NSTC suite execution.
For this reason, the noise data items should also have the same property of
ciphertexts that are encrypted by the same cryptosystem. The noise data items
are generated by multiplicating two of the data items in G , i.e. ii uti ddd ' .
Thus id ' has the same property as it
d and iud .
To ensure that sufficient noise data items are supplied, 'n is designed to be
randomly selected from )]1(...,,2
[2
nn
n. Thus 'n can be guaranteed to be at
least
2
n times more than n . This interval is configurable. By controlling the
length of this interval, a required level of security can be achieved.
153
How 1G and 2G are generated from G .
How 1G and 2G are generated plays a pivotal role in the computation, as they are
used as the input in DSP to generate GTG" . There are two design issues associated
with this generation. The first issue is how to generate a sufficient number of random
noise data items and how to avoid the extreme cases where 1n or 2n is close to n . In
such cases, the number of noise data items that can be generated by DSP will be
extremely limited. Our solution to resolve this problem is to set 1n and 2n to a value
close to
2
n, i.e. to let
3
2
3
nn
ni , }2,1{i . In this way the size of the noise
data items pool can be larger than ))13
(3
(
nn and smaller than
))13
2(
3
2(
nn. Secondly, the number of random numbers that can be generated
for 1
1"GT
G and 2
2" GTG are less than )1( 11 nn and )1( 22 nn , respectively. One way to
address this problem is to generate new random inputs using more than three data
inputs in 1G or 2G in step (2) and step (3). For example, the STTP can initially
randomly select a number, mn , from 1n or 2n , then randomly pick up mn data items
from 1G or 2G and finally generate the new noise data item by multiplying these mn
data items. In the case where
221
nnn , we may increase the size of the selection
pool of 1
1"GT
G and 2
2" GTG from
2
2
n
C to
2
12
2
3
2
2 ...
n
n
nn
CCC .
How Privacy is preserved.
Upon a protocol suite execution, the probability for a curious party to successfully
guess the value of 1n and 1'n (or 2n and 2'n ) from 1
1"GT
G (or 2
2" GTG ) is
)2'(
1
11 nn
(or )2'(
1
22 nn). In the case where a party has successfully inferred 1n and 1'n (or
2n and 2'n ), it has to further infer 1GT (or 2GT ) in order to work out the inputs in 1G
(or 2G ), thus disclosing G . In such case, the probability of disclosing the data inputs
154
of 1G (or 2G ) is )2'()!'(
)!'()!(
1111
11
nnnn
nn (or
)2'()!'(
)!'()!(
2222
22
nnnn
nn), which is
negligible.
6.4 The P22NSTC Protocol Suite and Its Operation
The goal of the P22NSTC protocol is to compute the test result of a sign test
computation on X and Y with the assistance from an on-line STTP, while keeping
Alice from knowing Y and },,{ RQP , keeping Bob from knowing X and },,{ RQP ,
and keeping the STTP from knowing X and Y . Prior to executing the computation,
Alice and Bob are assumed to have negotiated , z and the use of an additively
homomorphic cryptosystem with a key pair ),( skpk , i.e. ))(),(,,( skpk DEskpk . They
have also jointly defined a function )(dO , where
0,1
0,0
0,1
)(
d
d
d
dO . Using the
building blocks described in chapter 4 and the components described in section 6.3.2,
we can now design the P22NSTC protocol.
155
6.4.1 The Operation
The P22NSTC protocol suite
Input: X (from Alice) and Y (from Bob).
Output: Alice receives FR from STTP; Bob receives FR STTP. ( FR is an integer
value indicating either “1”: }{ 0Hreject or “0”: }{ 0Hrejectnotdo .
(1) For ni ...,,1 and given ix , Alice computes )( ipk xE .
(2) Alice sends 1)}({ i
nipk xE , and z to STTP.
(3) For ni ...,,1 and given iy , Bob computes )( ipk yE .
(4) Bob sends 1)}({ i
nipk yE , and z to STTP.
(5) For ni ...,,1 , STTP computes )()()( iipkipkipki yxEyExEe .
(6) Assuming that }...,,{ 1 neeG , the STTP executes DSP (input: G ; output: 1G
and 2G ).
(7) STTP executes DRP (input: 1G ; output 1
1"GT
G ).
(8) STTP executes DRP (input: 2G ; output 2
2" GTG ).
(9) STTP sends 1
1"GT
G to Alice.
(10) STTP sends 2
2" GTG to Bob.
(11) Assuming that }"...,,"{" 1
)'(
1
11 11
1
nn
TeeG G
. For )'(...,,1 11 nni , Alice decrypts
1"ie using )(skD , i.e. Alice computes )"(" 11
iski eDde , and generates
1
)'(
1
11}"{
i
nnide .
(12) For )'(,...,1 11 nni , Alice calculates 11 )"( ii odeO .
(13) Assuming that }"...,,"{" 2
)'(
2
12 22
1
nn
TeeG G
. For )'(...,,1 22 nnj , Bob decrypts
2" je using )(skD , i.e. Bob computes )"(" 22
jskj eDde , and generates
1
)'(
2
11}"{
j
nnjde .
(14) For )'(...,,1 22 nnj , Bob calculates 22 )"( jj odeO .
(15) Alice sends 1
)'(
1
11}{
i
nnio to STTP.
(16) Bob sends 1
)'(
2
22}{
j
nnjo to STTP.
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 8
Task 7
msg1
msg2
msg3
msg4
msg5
msg6
156
(17) Let }'...,,'{' 1
)'(
1
1
1
11 nnooO , STTP computes T
GG TOO11
11 '' .
(18) Let }'...,,'{' 1
)'(
1
1
1
111 nnG roroO , STTP singles out }'...,,'{ 11
1 1nroro from
}'...,,'{' 1
)'(
1
1
1
111 nnG roroO , where }'...,,'{ 11
1 1nroro contains the first 1n data items
in }'...,,'{' 1
)'(
1
1
1
111 nnG roroO .
(19) Let }'...,,'{' 2
)'(
2
1
2
22 nnooO , STTP computes T
GG TOO22
22 '' .
(20) Let }'...,,'{' 2
)'(
2
1
2
222 nnG roroO , STTP singles out }'...,,'{ 22
1 2nroro from
}'...,,'{' 2
)'(
2
1
2
222 nnG roroO , where }'...,,'{ 22
1 2nroro contains the first 2n data
items in }'...,,'{' 2
)'(
2
1
2
222 nnG roroO .
(21) STTP computes P by calculating the frequency of 1 in 11
1}'{ i
niro and 12
2}'{ j
njro .
(22) STTP computes Q by calculating the frequency of 0 in 11
1}'{ i
niro and 12
2}'{ j
njro .
(23) STTP computes R by calculating the frequency of -1 in 11
1}'{ i
niro and 12
2}'{ j
njro .
(24) STTP performs the sign test by using RQPzn ,,,,, and generates the final test
result FR .
(25) STTP sends FR to Alice
(26) STTP sends FR to Bob.
Figure 40. The P22NSTC algorithm.
(Source: Author’s own)
The rationales of key computation steps and the formats of the computational
elements in Figure 40 are further described below.
In step (5), ie is a ciphertext generated by computing )()( ipkipk yExE , where
)( ipk xE and )( ipk yE are two ciphertexts encrypted by the additively
homomorphic cryptosystem ))(),(,,( skpk DEskpk , for ni ...,,1 .
In step (6), assuming that }...,,{ 1 neeG , 1G and 2G are generated by DSP where
21 GGG and 21 GG . We further assume }...,,{ 11
11 1neeG and
}...,,{ 22
12 2neeG , where 21 nnn , i
je equals some ke for }...,,1{ nk , }2,1{i ,
and 11 |}...,,1{ inj or 22 |}...,,1{ inj .
In steps (7) and (9), }"...,,"{" 1
)'(
1
11 11
1
nn
TeeG G
.
Task 10
msg8
msg7
Task 9
157
In steps (8) and (10), }"...,,"{" 2
)'(
2
12 22
1
nn
TeeG G
.
In step (11), for )'(...,,1 11 nni , 1"ide is a plaintext.
In step (12), Alice uses the transformation function, i.e. )(dO , to transform 1"ide
into 1
io . The value of 1
io is either 1, 0 or -1.
In step (13), for )'(...,,1 22 nnj , 2" jde is a plaintext.
In step (14), Bob uses the transformation function, i.e. )(dO , to transform 2" jde
into 2
jo . The value of 2
jo is either 1, 0 or -1.
In steps (17) - (18), assuming that 1'O is a row vector of dimension )'(1 11 nn
where the row order has been swapped. Then 1
1'GO is a row vector of dimension
)'(1 11 nn where the row order has been restored, i.e. 11 1
11
1 ]'...,,'[ nnroro is the
key vector (the decrypted and transformed computational result of 1G ) and
1111 '1
1
)'(
1
)1( ]'...,,'[ nnnn roro is the noise vector.
In steps (19) - (20), assuming that 2'O is a row vector of dimension )'(1 22 nn
where the row order has been swapped. Then 2
2'GO is a row vector of dimension
)'(1 22 nn where the row order has been restored, i.e. 22 1
22
1 ]'...,,'[ nnroro is the
key vector (the decrypted and transformed computational result of 2G ) and
2222 '1
2
)'(
2
)1( ]'...,,'[ nnnn roro is the noise vector.
How privacy-preservation is achieved.
Because ix and iy are encrypted before being sent to the STTP and the STTP does
not have access to the encryption key pair ),( skpk , X and Y are kept private from
the STTP. For operations performed by Alice (or by Bob), 1
1"GT
G (or 2
2" GTG ) only
contains 1n (or 2n ) data items from G with the rest, i.e. 1'n (or 2'n ), being noise data
items, and )',( 11 nn (or )',( 22 nn ) being only known to the STTP. As shown in
subsection 6.3.2.3, the chance of Alice (or Bob) to guess G from 1
1"GT
G (or G from
2
2" GTG ) is negligible. The only possible inference for Alice (or for Bob) to make is to
guess a subset of G . In Alice’s case (for Bob’s case, the discussion is identical except
that 1
1"GT
G is replaced by 2
2" GTG ), for example, there are 1
33
2
nn possibilities
158
regarding the possible data item subsets of G , i.e. 1
1"GT
G contains
3
ndata items
from G , 1
1"GT
G contains 13
n data items from G ,…, 1
1"GT
G contains
3
2n data
item from G . The probability for Alice to successfully guess these data item subsets
is
))133
2()!'()'((
)!)!'(
Pr
1111
3
2
3
11
nnnnnn
iinni
n
ni
AG . The upper bound of AGPr is
)!'()'(
)!3
2)
3
2'(
3
2(
Pr1111
11
nnnn
nnnn
n
up
AG
, while the lower bound of AGPr is
)!'()'(
)!3
)3
'(3
(
Pr1111
11
nnnn
nnnn
n
lowAG
, i.e. up
AGAGlowAG PrPrPr . And so does BGPr ,
i.e. up
BGBGlowBG PrPrPr .
In a case when malicious Alice (or Bob) sends fake X (or Y ) to the STTP, the most
optimistic inference she (or he) can get from 1
1"GT
G (or 2
2" GTG ) is partial information
of G , with probability up
AGPr (or up
BGPr ). If malicious Alice (or Bob) further decides
to send fake 1
)'(
1
11}{
i
nnio (or 1
)'(
2
22}{
j
nnjo ) to the STTP, she (or he) can only break this
computation, no further information will be gained.
6.4.2 Correctness
In steps (7) and (8) of Figure 40, the STTP has employed DRP to transform 1G and
2G into 1
1"GT
G and 1
1"GT
G , respectively. In steps (12) and (14), Alice and Bob further
transform 1"ide and 2" jde using
0,1
0,0
0,1
)(
d
d
d
dO , respectively. By reversing the
effects of 1T and 2T , 1
1'GO and
1
2'GO can be restored. As }'...,,'{' 1
)'(
1
1
1
111 nnG roroO and
}'...,,'{' 2
)'(
2
1
2
222 nnG roroO , }'...,,'{ 11
1 1nroro and }'...,,'{ 22
1 2nroro can then be identified.
}'...,,'{ 11
1 1nroro is the transformed comparison results of )( ii yx , for 1...,,1 ni ;
159
}'...,,'{ 22
1 2nroro is the transformed comparison results of )( ii yx , for nni ...,,11 ,
where 21 nnn . The transformed result of YX can be acquired. The STTP can
then use this to compute P , Q and R . Therefore the sign test result can be
computed correctly.
6.4.3 Protocol Analysis
6.4.3.1 Privacy Analysis against Privacy Requirements
The privacy-preserving properties provided by the P22NSTC protocol suite can be
classified as threefold:
(1) The privacy protection provided by the additively homomorphic cryptosystem. The
cryptosystem is used by both Alice and Bob to secure their respective dataset prior
to sending to the STTP. It protects the individual data confidentiality and
individual privacy of X and Y from being known to the STTP. The Paillier
cryptosystem is used in the design of the P22NSTC protocol suite. As the Paillier
cryptosystem has been proven to be secure in the literature [PAIL’99b], here it is
regarded as being irreversible, i.e. an intruder can not decrypt the encrypted data
items.
(2) The privacy protection provided by the DSP and the DRP. The use of DSP and
DRP prevents individual data confidentiality and individual privacy of Y from
being known to Alice, individual data confidentiality and individual privacy of X
from being known to Bob and corporate privacy of },,{ RQP from being known
to Alice and Bob. As the two protocols are designed based on data disguising
techniques, the level of privacy provided is dependent on 1n , 1'n , 2n and 2'n .
(3) The privacy protection provided by the many-to-one transformation function. This
function is used by both Alice and Bob to secure the values of the differences of
the pairwised comparisons. This function is irreversible. Even if the STTP
receives the plain transformation results, it can not infer the actual values of the
difference of the pairwised comparison.
As the additively homomorphic cryptosystem and the data transformation technique
are assumed to preserve privacy completely, in this subsection, we focus on analysing
the level of privacy provided by DSP and DRP. This can be further classified into two
160
aspects: (1) the privacy-preserving effect provided by the DSP and (2) the privacy-
preserving effect provided by the DRP.
6.4.3.2 Quantify Level of Privacy Using Entropy
The case where malicious Alice wants to infer information from 1
1"GT
G is analysed in
this subsection. (The case is identical to malicious Bob’s case.) The probability for
malicious Alice to guess 1n is nnn
3
33
2
1
. The probability for malicious Alice
to guess 1'n is )2(
2
2)1(
1
1
2
1
2
111
nnnnn
and the probability to figure out the
permutation matrix 1GT is
)!'(
1
11 nn . The probability for malicious Alice to infer the
correct ix is ))!')((2(
6)(
111
2
1
)'',,( 1111 nnnnnxP inNnNnN
for 1,...,1 ni and the
probability for malicious Alice to infer the incorrect ix (noise data item) is
1
1111
2
1)'',,(
'
))!')((2(
61
)(
1
1111 n
nnnnnxP
n
i
inNnNnN
for )'(...,),1( 111 nnni ,
such that 1)(11
1111
'
1 )'',,(
nn
i inNnNnN xP . Thus, the entropy value of DRP and DSP
is
))'
))!')((2(
61
(log)'
))!')((2(
61
(
)))!')((2(
6(log)
))!')((2(
6((
11
1
11
1
1111
'
11
1111
2
12
1
1111
2
1
111
2
1
21111
2
1
)'',,(
nn
ni
n
i
n
i
n
i
nNnNnN
n
nnnnn
n
nnnnn
nnnnnnnnnn
h
.
n is the size of Alice’s dataset. 1n is a value randomly drawn from ]3
2,
3[
nn and
managed by the STTP. 1'n is a value randomly drawn from )]1(,2
[ 11
2
1
nn
n and also
managed by the STTP.
161
The analysis of entropy values will be performed by assuming sample size N = {10,
20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000,
3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000,
60000, 70000, 80000, 90000, 100000}. According to our protocol design, 1n (or 2n )
and 1'n (or 2'n ) are both larger than n . Consequently, a large n will lead to a large
number of 1n (or 2n ) and 1'n (or 2'n ), respectively. In addition, the computation of
entropy involves the factorial computation of )!'( 11 nn and the logarithm
computation of )))!')((2(
6(log
111
2
1
2nnnnn
. The value of
)))!')((2(
6(log
111
2
1
2nnnnn
can not be calculated by MATLAB when N≧30 as
0))!')((2(
6
111
2
1
nnnnn
, MATLAB returns the value as a NaN (Not-a-Number).
As our purpose is to analyse the trend of the entropy value curve and the value of
))!')((2(
6
111
2
1 nnnnn decreases when sample size increases, we use the value of
))!')((2(
6
111
2
1 nnnnn generated from N=20 in the cases of N≧30. Since 1n , 2n , 1'n
and 2'n are generated randomly, the entropy values are calculated by averaging the
experimental results from executing the program 100 times.
Tables 13-16 show the entropy values and the increment of entropy when sample size
changes. Figures 41-44 plot the relationship between the entropy values and sample
sizes. Figure 45 further summarized the overview of the entropy versus sample size. It
can be observed from Figures 41-44 that although the scales of sample size of these
four figures are different, they all show a similar trend. The level of increase in
entropy decreases when sample size increases. The highest level of increase of
entropy can be observed when sample size is under 100. When the sample size
exceeds 10000, the level of increase in entropy is more gradual than the cases where
sample sizes are smaller then 10000.
162
Table 13. The entropy value and the increment versus sample size. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.) (Source: Author’s own)
Table 14. The entropy value and the increment versus sample size. (Sample size N = 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000.) (Source: Author’s own)
Table 15. The entropy value and the increment versus sample size. (Sample size N = 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000.) (Source: Author’s own)
Table 16. The entropy value and the increment versus sample size. (Sample size N = 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000.) (Source: Author’s own)
163
Figure 41. The entropy value versus sample size. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.) (Source: Author’s own)
Figure 42. The entropy value versus sample size. (Sample size N = 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000.) (Source: Author’s own)
164
Figure 43. The entropy value versus sample size. (Sample size N = 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000.) (Source: Author’s own)
Figure 44. The entropy value versus sample size. (Sample size N = 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000.) (Source: Author’s own)
165
Figure 45. The entropy value versus sample size. (overview) (Source: Author’s own)
6.4.3.3 Computational Overhead
In this protocol, 19)''(45 21 nnn computations are performed, which includes
)''(3 21 nnn cryptographic computations, )''(26 21 nnn data disguising
computations and )''(13 21 nnn non-privacy-preserving computations
(multiplication, addition and sign test operation). The computation cost increases
along with data size n and security parameters 1'n and 2'n .
Owing to the effect of the data randomization technique in DRP, our program can not
execute simulation when N≧700 as the data size exceeds the maximum number of
elements in a real double array, i.e. 14108.2 . The simulation results are acquired
when sample size N = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600,
700}. A key length of 128 bits is used in the additively homomorphic cryptosystem.
The computational overhead is calculated by averaging the experimental results from
executing the program 100 times.
166
Figure 46. Computational overhead for non-cryptographic operations and cryptographic operations vs. number of noise data items added by the STTP for Alice and Bob, respectively. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) (Source: Author’s own)
Figure 47. Computational overhead for non-cryptographic operations and cryptographic operations vs. number of noise data items added by the STTP. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) (Source: Author’s own)
167
Figures 46 and 47 show the relationship between the number of computational
operations versus the number of noise data items added by the STTP for Alice and
Bob, respectively. The number of computational operations increases approximately
linearly when 1'n and 2'n increase. The number of non-cryptographic operations is
larger than the number of cryptographic operations.
6.4.3.4 Communication Overhead
Eight messages are exchanged during the entire protocol suite execution, with four of
them being transmitted by the STTP, and two by Alice and Bob, respectively.
Assuming that I is the number of bits used to represent the value of each of the
plaintext data items (e.g. input/output date items) and 'I is the number of bits needed
to represent the value of each of the ciphertext data items, the communication
overhead for the P22NSTC protocol suite is ')''34()''2( 2121 InnnInnn ,
where n is the size of X and Y and 2211 ',,', nnnn are security parameters randomly
chosen by the STTP.
Figures 48 and 49 show the relationship between the communication overhead versus
the number of noise data items added by the STTP for Alice and Bob, respectively.
The amount of communication overhead is shown in two aspects: the number of non-
encrypted data items and the number of encrypted data items. The communication
overhead is calculated by averaging the experimental results from executing the
program 100 times. The communication overhead increases approximately linearly
when the number of noise data items increases. An observation can be made that the
number of non-encrypted data items and the number of encrypted data items are
similar. This echoes the theoretical result, i.e. ')''34()''2( 2121 InnnInnn .
As 21 '' nnn , this value is largely dependent on 21 '' nn .
168
Figure 48. Communication overhead for non-encrypted data items and encrypted data items vs. number of noise data items added by the STTP for Alice and Bob, respectively. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) (Source: Author’s own)
Figure 49. Communication overhead for non-encrypted data items and encrypted data items vs. number of noise data items added by the STTP. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) (Source: Author’s own)
169
6.4.3.5 Protocol Suite Execution Time
Figure 50. Protocol suite execution time vs. number of noise data items added by the STTP for Alice and Bob. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) (Source: Author’s own)
Figure 51. Protocol suite execution time vs. number of noise data items added by the STTP. (Sample size N = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700.) (Source: Author’s own)
170
Figures 50 and 51 show the relationship between the execution time versus the
number of noise data items added by the STTP for Alice and Bob, respectively. The
execution time is calculated by averaging the experimental results from executing the
program 100 times. The execution time increases exponentially when 1'n and 2'n
increase. The more the noise data items, the faster the rate of increase of the execution
time.
6.5 Chapter Summary
This chapter has presented the detailed design of the P22NSTC protocol suite. By
specifically designing the local computational tasks and making use of the data
perturbation techniques and the homomorphic cryptosystem, with the assistance from
the on-line STTP, the two parties can securely perform the sign test computation. The
privacy-preserving features of the protocol suite have been analysed for each
computational task on a task base. With the assistance from the STTP, this protocol
suite can prevent a party from knowing other party’s dataset and the intermediate
computational results of the computation. It also prevents the STTP from knowing the
parties’ private datasets. The correctness, the level of privacy provided, the
computational cost, the communication cost and the execution time have also been
theoretically and experimental analysed. A comparison of the TTP-NST, the P22NSTP
and the P22NSTC will be presented in chapter 7.
171
Chapter 7 A Comparison of the TTP-NST, the P22NSTP
and the P22NSTC
7.1 Chapter Introduction
This chapter compares the performances of the P22NSTP and the P
22NSTC against the
TTP-NST. The TTP-NST, the P22NSTP and the P
22NSTC algorithms have all been
prototyped and implemented using MATLAB. As MATLAB provides sign test
computation function in its statistical computation library, the TTP-NST program uses
the provided function to carry out the sign test computation. The P22NSTP protocol
suite is a novel design; its algorithm is completely programmed by MATLAB.
However, since MATLAB does not provide cryptosystem library, the homomorphic
cryptosystem to be used in the novel P22NSTC protocol suite is programmed in JAVA
code as a function. The P22NSTC protocol suite then employs this function during the
protocol suite execution. Section 7.2 presents the comparison of privacy protections
provided by the three algorithms. The comparisons of computational overhead and
communication overhead are presented in section 7.3 and 7.4 respectively. Section 7.5
compares the execution times and finally section 7.6 summarises this chapter.
172
7.2 A Comparison of Privacy Protection
Figure 52. A comparison of privacy protection by the TTP-NST, the P22NSTP and the
P22NSTC.
(Source: Author’s own)
Figure 52 summarises the privacy protection provided by the TTP-NST and the
designed solutions. It is compared against individual data confidentiality, individual
privacy and corporate privacy, each with three privacy requirements.
In the TTP-NST model, the TTP does all the computation for Alice and Bob. Alice
has no access to Y (i.e. L12 is completely protected.) and Bob has no access to X
(i.e. L11 is completely protected.); after the completion of the computation, all they
know are their own private data input and the final computation result (i.e. L21, L22
L31 and L31 are all completely protected.). In this model, the TTP knows all the
information about X and Y (i.e. L13, L23 and L33 are not protected.), except for the
identities of the data subject in X and Y .
In the P22NSTP computation, no third party is involved (i.e. L13, L23 and L33 are
completely protected.); the computation is carried out by Alice and Bob jointly. It is
noted that Y is completely kept away from Alice (i.e. L12 is completely protected.) as
Alice has no access to Y during the P22NSTP execution. All other privacy protection
requirements, i.e. L11, L21, L22, L31 and L32 are protected by data perturbation
techniques. More specifically, L11 and L21 are protected by DOP; L22 is protected
173
by STCP; L31 and L32 are protected by DOP and STCP jointly. The parties may infer
information from the intermediate results he/she has received, but this can only be
achieved with great difficulty.
In the P22NSTC computation, an on-line semi-trusted third party (STTP) is employed.
An additively homomorphic cryptosystem is also used by both Alice and Bob to
prevent the STTP from knowing X and Y (i.e. L13 and L23 are protected by
cryptosystem.). As both X and Y are encrypted and sent to the STTP for the first
stage computation, Alice and Bob have no access to Y and X , respectively (i.e. L11
and L12 are completely protected). After the first stage computation, the STTP
employs DSP and DRP to protect the intermediate computation results before sending
them to Alice and Bob. Both L21 and L22 are protected by data perturbation
techniques. The final sign test computation is computed by the STTP, thus Alice and
Bob have no access to },,{ RQP (i.e. L31 and L32 are completely protected.). It is
noted that the STTP is allowed to know },,{ RQP (i.e. L33 is not protected) in order
to perform the sign test computation. This is a compromise to provide better
protection for individual data confidentiality and individual privacy against
participating parties.
174
7.3 A Comparison of Computational Overhead
Figure 53. A comparison of computation overhead. (Source: Author’s own)
As mentioned in chapter 5 and 6, the total computations needed for the P22NSTP
computation is 29)'(6)1)(2)(3(2 lnnlll of computations, including
20)'(5 nn non-privacy-preserving computations and
9')1)(2)(3(2 lnnlll data disguising computations. The computational
complexity of the P22NSTP is )( 3nO . The total computations needed for the P
22NSTC
is 19)''(45 21 nnn . This includes )''(3 21 nnn cryptographic computations,
)''(26 21 nnn data disguising computations and )''(13 21 nnn non-privacy-
preserving computations (multiplication, addition and sign test operation). The
computational complexity of the P22NSTP is )(nO . )4( n non-privacy-preserving
computations are performed in the TTP-NST. The computation complexity of TTP-
NST is )(nO . It can be seen from Figure 53 that the computation complexity of the
P22NSTC is smaller than the P
22NSTP. This is because of the noise data item generated
in the P22NSTP.
175
7.4 A Comparison of Communication Overhead
Figure 54. A comparison of communication overhead (Source: Author’s own)
Both the TTP-NST and the P22NSTP use four messages throughout the computation,
while the P22NSTC use eight messages. The extra four messages are used for Alice and
Bob to interact with the STTP. Assuming that I is the number of bits used to represent
the value of each of the plaintext data items (e.g. input/output date items) and 'I is the
number of bits needed to represent the value of each of the cipher text data items, the
TTP-NST consumes nI4 bits communication costs and the P22NSTP consumes
Illllnnnn )1)1)(2)(3()3)('()'(2( 22 communication cost, both for plaintext
data items. The P22NSTC consumes ')''34()''2( 2121 InnnInnn bits
communication cost, which Innn )''2( 21 bits are for plaintext data items and
')''34( 21 Innn bits are for encrypted data items.
176
7.5 A Comparison of Execution Time
Table 17. The execution time of the TTP-NST, the P22NSTP and the P
22NSTC (sec).
(Source: Author’s own)
Figure 55. A comparison of execution time for the TTP-NST, the P22NSTP and the
P22NSTC (sec).
(Source: Author’s own)
Table 17 and Figure 55 compare the execution times for the TTP-NST, the P22NSTP
and the P22NSTC. According to the graph, the performance of the P
22NSTP is much
lower than the P22NSTC. This is because the additively homomorphic cryptosystem
used in the P22NSTC is a JAVA program, while the P
22NSTP is completely
programmed in MATLAB. As MATLAB is a high level programming language, the
MATLAB code needs to be executed in its platform at every execution of the program.
177
The function provided by a compiled JAVA code is much more efficient than the
MATLAB program. The execution time of the P22NSTC increases exponentially
when the sample size increases.
Table 18. A table of protocol efficiency for the TTP-NST, the P22NSTP and the P
22NSTC.
(Source: Author’s own)
Figure 56. A comparison of protocol efficiency for the TTP-NST, the P22NSTP and the
P22NSTC.
(Source: Author’s own)
Table 18 and Figure 56 compare the protocol efficiencies for the TTP-NST, the
P22NSTP and the P
22NSTC. The protocol efficiency is calculated by the equation
defined in section 3.5.4, i.e. )(
)(Pr..
datasettheofSize
TimeExecutionotocolEP . The ..EP value
178
represents the average computation time for a single data input under a given security
level, the lower the ..EP value the better the efficiency. According to the figure, the
P22NSTP is the least efficient one. The efficiency of the P
22NSTC decreases when
sample size increases.
179
7.6 Further Discussions
In terms of computational overhead, cryptographic primitives are normally
computationally expensive, as they involve modular exponentiation computations.
Data perturbation techniques, however, are normally more computationally efficient,
as they only involve simple algebraic operation. For this reason, it was expected that
the P22NSTP protocol suite would outperform the P
22NSTC protocol suite. However,
our expectation does not match with this anticipation.
The computation of the P22NSTP involves a large number of noise data additions and
matrix multiplication operations that are contributed by the noise data items. Although
the algebraic and swap operations are simpler operations than the
encryption/decryption operations, the noise data items cause too many additional
operations. The computational complexity of the P22NSTP is )( 3nO . In our
experiment, the number of data items exceeds the number that the MATLAB system
can process. In the P22NSTC, although it employs Paillier cryptosystem, the
computational times are much less than the P22NSTP. The computational complexity
of the P22NSTC is )(nO . As MATLAB does not provide a cryptographic library, the
Paillier cryptosystem is programmed in Java and employed as a function in the
P22NSTC. This can be regarded as using the cryptosystem function library from
MATLAB. (Libraries written in Java, ActiveX or .NET can be directly called from
MATLAB command interface [MATL’12].) In comparison, the P22NSTP protocol
suite is entirely programmed in MATLAB, and the P22NSTC protocol suite is mainly
programmed in MATLAB with only one exception, that is, the Paillier cryptosystem
is programmed in Java. The implication of using this Java programmed Paillier
cryptosystem on the experimental results is minimal. This is because MATLAB is a
high level programming language and so is Java. MATLAB allows libraries written in
Java, ActiveX or .NET directly called from MATLAB command interface
[MATL’12]. As MATLAB does not provide a cryptographic library, so we wrote our
own Paillier cryptosystem as a library to support the P22NSTC simulation.
Through this study, we can make the following observations and suggestions. Being
restricted to the nature of the sign test, the values of data inputs should not be altered
before the pairwised comparison. The noise data item addition and an additively
homomorphic cryptosystem are used to address this restriction. The noise data item
180
addition adds additional costs to the computational effort. The majority of the
computation tasks in the P22NSTP are on the computation of noise data items.
Therefore, if the computation involves the computation on individual data items,
using data perturbation techniques are less efficient and cryptographic primitives
should be used. For the computation on aggregated data, for example, the mean value
and variance, as the main computation in relation to individual data items has already
been fulfilled and the volume of data items are not large (compare with the
computation on all individual data items), applying data perturbation techniques on
aggregated data would be more efficient.
In a scenario where a third party is not available, the P22NSTP is the only solution. A
way to improve its efficiency is to lower the degrees of security parameters, i.e. to
select 1k , 2k , 3k , 4k , 'n and l from a smaller interval, i.e. to use )10,1(U or )5,1(U .
However, this way, it will make it easier to compromise the corporate privacy, i.e. the
distribution property of X and Y can be guessed more easily. If a third party is
available in a computation, the P22NSTC is preferable method. The efficiency of the
P22NSTC can be managed by controlling the key length of the additively
homomorphic encryption scheme and the number of noise data items added.
For real world applications, the P22NSTC can be implemented in a cloud computing
environment. The role of the STTP can be played by a trusted entity in the cloud; the
parties can submit their dataset to this entity anytime and anywhere. The sign test
algorithm in this privacy-preserving computation algorithm can be replaced by other
statistical algorithms if the cloud is to support other types of computational tasks. In
such cases, the data partitioning model and the prefix of the computation will need to
be altered accordingly. The number of participants can also be extended, not restricted
to only two participants. More detailed discussions on possible extensions of this
research are given in the next chapter.
181
7.7 Chapter Summary
This chapter has compared the two protocol suites presented in this thesis, i.e. the
P22NSTP and the P
22NSTC, with the TTP-NST model. The comparisons of level of
security, computational overhead, communication overhead and protocol execution
time have been conducted. The protocol efficiency has also been compared using the
average computation time for a single data input under different sample sizes. The
result of our theoretic analysis anticipated the P22NSTP to be more efficient than the
P22NSTC, however there is a discrepancy between the comparison results and our
theoretic expectation. Theoretically, a protocol which employs data perturbation
techniques is more computational efficient than a protocol employing cryptographic
primitives. The reason for this comparison result is that the two prototypes were not
implemented at the same standard. As the Paillier cryptosystem was programmed and
complied by JAVA, this largely improved the efficiency of the P22NSTC. However,
the P22NSTC is still far less efficient the TTP-NST since TTP-NST does not provide
any privacy-preserving properties.
The next chapter concludes this thesis and gives recommendation for future work.
182
Chapter 8 Conclusion and Future Work
This chapter summarizes the work presented in this thesis, provides the conclusions
drawn from the research findings, and finally recommends the direction for future
work.
8.1 Thesis Summary
8.1.1 Review of the Thesis
The work presented in this thesis can be arranged into four parts: research background,
the design of the two-party solution, i.e. the P22NSTP protocol suite, the design of the
on-line STTP solution, i.e. the P22NSTC protocol suite and the implementation and
evaluation of both solutions.
Research Background
Chapter 1 explained how distributed data computation can benefit from employing
privacy-preserving techniques. The privacy concerns in distributed statistical
computation problems have also been outlined in this chapter. Chapter 2 provided a
set of definitions and terminologies that are commonly used in the literature. An
extensive review of related works has also been presented, which outlines what
approaches and privacy-preserving techniques have been employed in specific
research problems and how existing works have attempted to tackle security
threats/risks. Chapter 3 provided the design preliminaries for the work presented in
this thesis. It firstly defined the privacy definition to be used in this research. The
decomposition and analysis of the two-party nonparametric sign test (NST)
computation were further presented. The privacy considerations against security
threats to local and global computational tasks were then specified and the design
requirements that can be used in the design of our solution were then extracted. Based
on the design requirements, chapter 4 described a set of privacy-preserving building
blocks that were used in the design of our solutions. Both data perturbation techniques
and a cryptographic primitive were used in this research. The data perturbation
techniques included data swapping, data randomization and data transformation
techniques. The cryptographic primitive included an additively homomorphic
183
cryptosystem, more specifically the Paillier cryptosystem, which was used in this
research work.
The Two-party Solution
Chapter 5 presented the detailed design of our novel two-party privacy-preserving
solution that employed data perturbation techniques, i.e. the P22NSTP protocol suite.
This solution achieved privacy-preserving computation without the need to employ
computationally expensive cryptographic primitives or a third party. To clearly
identify the security threats, the entire computation was decomposed into four local
computational tasks and each local computational task fulfilled one computation step.
To support this computation, five novel protocols have been designed which were
based on data swapping, data randomization and data transformation techniques. They
were random probability density function generation function (RpdfGP), data
obscuring protocol (DOP), Security two-party comparison protocol (STCP), data
extraction protocol (DEP) and permutation reverse protocol (PRP). The RpdfGP was
used by DOP and STCP as a function to generate a randomised probability density
function (pdf) during its execution respectively. The second to the fifth protocols each
accomplished one computation step. These protocols enabled Alice and Bob to
cooperatively and securely conduct the local computational tasks in turns. The level
of privacy provided, the computational overhead and the communication overhead
were theoretically and experimentally analysed.
The On-line STTP Solution
Chapter 6 presented the detailed design of our novel privacy-preserving solution that
employs an on-line STTP and an additively homomorphic cryptosystem, i.e. the
P22NSTC protocol suite. It provided a more secure solution to the research problem,
but is much more theoretically computationally expensive. Two novel protocols have
been designed based on data swapping, data randomization and data transformation
techniques. They were dataset split protocol (DSP) and dataset randomization
protocol (DRP). An additively homomorphic cryptosystem was also used to support
the secure computation of the P22NSTC. While the additively homomorphic
cryptosystem enabled both Alice and Bob to encrypt their datasets before sending
them to the STTP, the DSP and the DRP enabled the STTP to disguise the
intermediate computational results before sending them to Alice and Bob,
184
respectively. The additively homomorphic property enabled the STTP to perform
computations on both encrypted and disguised data throughout the computation. The
two parties only interacted with the STTP, and did not communicate with each other
directly. The level of privacy provided, the computational overhead and the
communication overhead were theoretically and experimentally analysed.
Implementation and Evaluation
The protocols have been implemented and evaluated against an ideal-TTP model
(TTP-NST) using MATLAB in chapter 7.
1. The prototype of the TTP-NST utilised the sign test function provided by
MATLAB.
2. The P22NSTP protocol suite was a novel design; it was completely
programmed by MATLAB code.
3. The Paillier cryptosystem used in the P22NSTC was programmed and
compiled in JAVA as a function. The P22NSTC then employed this
function during the protocol suite execution.
The protocol suites have been evaluated theoretically and experimentally.
1. In the theoretical analysis, the protocols suites have been analysed using
mathematical equations and probability theory.
2. The security performance was analysed using probability functions, while the
computational and communication overheads have been calculated using
mathematical equations.
3. In the experimental analysis, the protocol execution times of the three models
were compared under different data sizes. More specifically, N = {10, 20, 30, 40,
50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000} was used
in the analysis.
8.1.2 Contributions
Four significant contributions to the knowledge area have been made by this research
work:
185
1. The two-party computation solution, i.e. the P22NSTP protocol suite, enables two
parties to securely perform the sign test computation on their joint dataset
without the use of any third party. In addition to satisfying all the specified
design requirements, the P22NSTP protocol suite places less computational
overhead on both parties than the P22NSTC protocol suite. Its communication
overhead is also smaller than the P22NSTC protocol suite as fewer messages are
needed and no encrypted data items are involved. The P22NSTP protocol suite is
more suited to circumstances where a third party is not available.
2. The on-line STTP solution, i.e. the P22NSTC protocol suite, enables two parties
to securely perform the sign test on their joint dataset with the assistance of a
third party. The use of the additively homomorphic cryptosystem enables this
alternative to achieve a more secure solution than the P22NSTP protocol suite.
Consequently, increased computational and communication overhead are needed.
The STTP shares the majority of computational load during the execution of this
protocol suite. This solution is suited for circumstances where a third party is
available.
3. Both protocol suites have been implemented and evaluated using MATLAB. To
the best of the author’s knowledge, this is the first systematic work to address the
statistical hypothesis problem using the privacy-preserving distributed
computation method. Although there is a discrepancy between the experimental
results and the theoretical results, the reasons have been clearly identified.
4. The development of a four-phase methodology to transform a normal statistical
algorithm to its privacy-preserving distributed counterpart. The four phases
include (1) data privacy definition, (2) statistical algorithm decomposition, (3)
solution design and (4) solution implementation. The designs of the two protocol
suites have been demonstrated as being practical and feasible. The research
findings from this work can contribute to the transformation of other privacy-
preserving distributed statistical computation (PPDSC) problems.
To summarize, the research aim to explore the usage of privacy-preserving distributed
statistical computation, to investigate current solutions in the related works, to design
solutions to the privacy-preserving tow-party nonparametric sign test computation and
to demonstrate the systematic methodology for transforming normal statistical
algorithm into its privacy-preserving counterpart, has been achieved.
186
8.2 Future Work
We have the following recommendations for future work:
The Improvement of the Current Solutions
There is a discrepancy between the experimental results and the theoretically analysis.
This can be attributed to two main reasons:
1. The prototypes were not programmed at the same standard. The Paillier
homomorphic cryptosystem has already been compiled to an executable function
in JAVA. The TTP-NST algorithm is a function provided by MATLAB, while
the P22NSTP was completely implemented in MATLAB. A fairer condition for
comparing the three models is worth further investigation. The fairer condition
could be provided if the Paillier cryptosystem and the sign test computation
algorithm can be programmed using MATLAB. The experience gained in this
process will be useful and applicable to other PPDSC problems using this
methodology.
2. The design of the two protocol suites largely used data perturbation techniques.
Consequently, a great number of noise data items were generated during the
experiment. In most of the cases, the number of noise data items exceeds the
capability of the test environment, thus the experiment can not gather sufficient
experiment results. This problem can be further investigated from two aspects: 1)
to refine the algorithms and 2) to refine the programming skill. The designs of
the two protocol suites make use of two sets of security parameters, respectively.
In addition, data randomization technique was the key factor for the amount of
noise data item generation. Its objective was to restrict each parameter with a
specific rule while achieving an acceptable level of privacy. Once this objective
has been fulfilled the prototype can then be refined with more confidence.
Upon the completion of the above work, the experience and lessons learnt can be used
in applications of other PPDSC problems. The following are some direction for this
application:
187
Other Nonparametric and Parametric Hypothesis Test Problems
In this thesis, the elementary nonparametric hypothesis test problem–sign test
problem is studied as our initial work because it is the fundamental element of
nonparametric hypothesis test problems. There are a number of other nonparametric
hypothesis techniques that are used in a variety of research domains but have not yet
been transformed into its privacy-preserving counterpart. The study of the
transformation of these techniques would be an interesting and challenging research
direction. In addition to the nonparametric hypothesis test problems, parametric
hypothesis test problems are more applicable in many research areas, but need more
statistical assumptions before performing the computation. One of our next stages of
work will be on the investigation and analysis of the design requirements for the
parametric hypothesis problems.
Factor Analysis Problems
In many research contexts, multiple population means and related statistics may need
to be analysed simultaneously. The populations to be compared correspond to the
values of one or more independent variables (factors) which may or may not affect the
response variable under the investigation. As a result, the factor analysis technique is
a solution to this kind of problem which investigates the relationship between a
dependent variable and one or more independent variables. The computation of factor
analysis involves the calculation of the sum of square, mean square and degree of
freedom. The global computation of this sort of algorithm differs greatly from other
statistical algorithms due to the simultaneously multiple comparisons and the degree
of freedom involved. We believe to develop the secure solution to this problem is a
challenge which can help to bridge the gap of privacy consideration in many research
areas.
Nonlinear Regression Problem
Nonlinear regression is another form of regression analysis, where observational data
are modelled by a nonlinear function, and as such depends on one or more
independent variables. Consequently, the best fit solution for the observations is a
curve function rather than a linear function. One of the most popular techniques to
address this problem is to proceed a transformation in the original observational data
188
and transform the relationship of these data into linear, however, this problem
becomes critical when multiple parties join this computation and all of them would
like to keep their data as secure as possible (to disclose as least information of their
data as possible). This brings a further challenge when developing its privacy-
preserving counterpart.
The Optimum Methodology for the Privacy-preserving Distributed Statistical
Computation
The ultimate goal of this research is to construct an optimum methodology to
transform normal statistical algorithms into its privacy-preserving counterpart. Once
the limitations of current solutions have been addressed and the gaps of those
undeveloped solutions should have been fulfilled this goal can be achieved.
189
References…………
ABAD’90 M. Abadi and J. Feigenbaum, (1990), “Secure Circuit Evaluation – A
Protocol Based on Hiding Information From an Oracle”, Journal of
Crytology: Volume 2(1), Springer-Verlag New York, pages 1-12.
ABAD’02 M. Abadi, N. Glew, B. Horne and B. Pinkas, (2002), “Certified Email
with a Light On-line Trusted Third Party: Design and Implementation”,
the Proceedings of the 11th
International Conference on World Wide
Web (WWW ‘02), pages 387-395, Honolulu, Hawaii, USA, 7-11 May
2002, ACM New York Press.
AGGA’08a C. C. Aggarwal and P. S. Yu, (2008), “Privacy-preserving Data Mining –
Models and Algorithms”, Advances in Database Systems: Volume 34,
Springer Science+Business Media, LLC.
AGGA’08b C. C. Aggarwal, (2008), “Privacy and the Dimensionality Curse”,
Privacy-preserving Data Mining – Models and Algorithms: Chapter 18,
Advances in Database Systems: Volume 34, Springer Science+Business
Media, LLC, pages 433-460.
AGRA’00 R. Agrawal and R. Srikant, (2000), “Privacy-preserving data Mining”, the
Proceedings of the 2000 ACM SIGMOD International Conference On
Management of Data (SIGMOD’00), pages 439-450, Dallas, Texas,
USA, 14-19 May 2000, ACM New York Press.
AGRA’01 D. Agrawal and C. C. Aggarwal, (2001), “On the Design and
Quantification of Privacy-preserving Data Mining Algorithms”, the
Proceedings of the 20th
ACM SIGMOD-SIGACT-SIGART Symposium on
Principles of Database Systems (PODS’01), pages 247-255, 21-23 May
2001, Santa Barbara, California, USA, ACM New York Press.
AGRA’03 R. Agrawal, A. Evimievski and R. Srikant, (2003), “Information Sharing
Across Private Databases”, the Proceedings of the 2003 ACM SIGMOD
International Conference on Management of Data (SIGMOD’03), pages
86-97, San Diego, California, USA, 9-12 June 2003, ACM New York
Press.
AGRA’04 R. Agrawal, J. Kiernan, R. Srikant and Y. Xu, (2004), “Order
Preserving Encryption for Numeric Data”, the Proceedings of the 2004
ACM SIGMOD International Conference on Management of Data
(SIGMOD’04), pages 563-574, Paris, France, 13-18 June 2004, ACM
New York Press.
AKIN’09 Mufutau Akinwande, (2009), “Advances in Homomorphic
Cryptosystems”, Journal of Universal Computer Science, Volume 15, No
3, Pages 506-522, J.UCS Press.
190
ATAL’01 M. J. Atallah and W. Du, (2001), “Secure Multi-Party Computational
Geometry”, the Proceedings of the 7th
International Workshop on
Algorithms and Data Structures (WADS 2001), Lecture Notes in
Computer Science: Volume 2125, pages 165-179, Providence, Rhode
Island, USA, 8-10 August 2001, Springer-Verlag, New York,.
ATAL’03 M. J. Atallah, F. Kerschbaum and W. Du, (2003), “Secure and Private
Sequence Comparisons”, the Proceedings of the 2003 ACM Workshop
on Privacy in the Electronic Society (WPES’03), pages 39-44,
Washington, DC, USA, 30 October, 2003, ACM New York Press,.
AUMA’07 Y. Aumann and Y. Lindell, (2007), “Security Against Covert
Adversaries: Efficient Protocols for Realistic Adversaries”, the
Proceedings of the 4th
Theory of Cryptography Conference (TCC 2007),
Lecture Notes in Computer Science: Volume 4392, pages 137-156,
Amsterdam, The Netherlands, 21-24 February 2007, Springer-Verlag
Berlin, Heidelberg.
BARA’83 I. Bárány and Z. Füredi, (1983), “Mental Poker with Three or More
Players”, Journal of Information and Control: Volume 59(1-3),
Academic Press Professional, Inc. San Diego, CA, USA, pages 84-93.
BEAV’90a D. Beaver, S. Micali and P. Rogaway, (1990), “The Round Complexity
of Secure Protocols”, the Proceedings of the 22nd
Annual ACM
Symposium on Theory of Computing, pages 503-513, Baltimore, MD,
USA, 13-17 May 1990, ACM New York Press.
BEAV’90b D. Beaver and J. Feigenbaum, (1990), “Hiding Instances in Multioracle
Queries”, the Proceedings of the 7th
Annual Symposium on Theoretical
Aspects of Computer Science (STACS 90), Lecture Notes in Computer
Science: Volume: 415, pages 37-48, Rouen, France, 22-24 February
1990, Springer-Verlag Berlin, Heidelberg.
BEAV’97 D. Beaver, (1997), “Commodity-based Cryptography (Extended Abstract)”,
the Proceedings of the 29th
Annual ACM Symposium on Theory of
Computing (STOC '97), pages 446-455, El Paso, Texas, USA, 4-6 May
1997, ACM New York Press.
BEAV’98 D. Beaver, (1998), “Server-assisted Cryptography”, the Proceedings of
the 1998 New Security Paradigms Workshop, pages 92-106,
Charlottesville, Virginia, USA, 22-25 September 1998, ACM New York
Press.
BEIM’04 A. Beimel and T. Malkin, (2004), “A Quantitative Approach to
Reductions in Secure Computation”, the Proceedings of the 1st Theory of
Cryptography Conference (TCC 2004), Lecture Notes in Computer
Science: Volume 2951, pages 238-257, Cambridge, Massachusetts, USA,
19-21 February 2004, Springer-Verlag Berlin, Heidelberg.
BENA’94 J. Benaloh, (1994), “Dense Probabilistic Encryption”, the Proceedings of
the Workshop on Selected Areas in Cryptography (SAC’94), pages 120-
128, Kingston, Ontario, Canada.
191
BEND’08 A. Ben-David, N. Nisan and B. Pinkas, (2008), “FairplayMP - A System
for Secure Multi-Party Computation”, the Proceedings of the 15th
ACM
Conference on Computer and Communications Security (CCS’08), pages
257-266, Alexandria, Virginia, USA, 27-31 October 2008, ACM New
York Press.
BENO’85 M. Ben-Or and N. Linial, (1985), “Collective Coin Flipping, Robust
Voting Schemes and Minima of Banzhaf Values”, the Proceedings of the
26th
Annual Symposium on Foundations of Computer Science (SFCS’85),
pages 408-416, Portland, Oregon, USA, 21-23 October 1985, IEEE
Computer Society Press.
BENO’88 M. Ben-Or, S. Goldwasser and A. Wigderson, (1988), “Completeness
Theorems for Non-Cryptographic Fault-Tolerant Distributed
Computation”, the Proceedings of the 20th
Annual ACM Symposium on
Theory of Computing (STOC’88), pages 1-10, Dallas, Texas, USA, 23-
26 May 1998, ACM New York Press.
BERT’05 E. Bertino, I. N. Fovino and L. P. Provenza, (2005), “A Framework for
Evaluating Privacy-preserving Data Mining Algorithms”, Journal of
Data Mining and Knowledge Discovery, Volume 11(2), pages 121-154,
Kluwer Academic Publishers Hingham, MA, USA.
BONE’07 D. Boneh, E. Kushilevitz, R. Ostrovsky, W. E. Skeith, (2007), “Public
Key Encryption That Allows PIR Queries”, the Proceedings of the 27th
Annual International Cryptology Conference on Advances in Cryptology
(CRYPTO 2007), Lecture Notes in Computer Science: Volume 4622,
pages 50-67, Santa Barbara, California, USA, 19-23 August 2007,
Springer-Verlag Berlin, Heidelberg.
CACH’99 C. Cachin, (1999), “Efficient Private Bidding and Auctions with An
Oblivious Third Party”, the Proceedings of the 6th
ACM Conference on
Computer and Communications Security (CCS '99), pages 120-127,
Singapore, 1-4 November 1999, ACM New York Press.
CACH’00 C. Cachin and J. Camenisch, (2000), “Optimistic Fair Secure
Computation (Extended Abstract)”, the Proceedings of the 20th
Annual
International Cryptology Conference on Advances in Cryptology
(CRYPTO’00), pages 94-115, Santa Barbara, California, USA, 20-24
August 2000, Springer-Verlag London, UK.
CANE’01 R. Canetti, (2001), “Universally Composable Security: A New Paradigm
for Cryptographic Protocols”, the Proceedings of the 42nd
IEEE
Symposium on Foundation of Computer Science (FOCs’01), pages 136-
145, Las Vegas, Nevada, USA, 14-17 October 2001, IEEE Computer
Society Press.
CANE’01 R. Canetti, Y. Ishai, R. Kumar, M. K. Reiter, R. Rubinfeld and R. N.
Wright, (2001), “Selective Private Function Evaluation with
Applications to Private Statistics”, the Proceedings of the 20th
Annual
ACM Symposium on Principles of Distributed Computing (PODC2001),
192
pages 293-304, Newport, Rhode Island, USA, 26-29 August 2001, ACM
New York Press.
CANO’10 I. Cano and S. Ladra and V. Torra, (2010), “Evaluation of Information
Loss for Privacy-preserving Data Mining through Comparison of Fuzzy
Partitions”, the Proceedings of the 2010 IEEE International Conference
on Fuzzy Systems (FUZZ), pages 1-8, Barcelona, Spain, 18-23 July 2010 ,
IEEE Computer Society Press.
CAST’04 J. Castella-Roca and J. Domingo-Ferrer, (2004), “On the Security of An
Efficient TTP-Free Mental Poker Protocol”, the Proceedings of the
International Conference on Information Technology: Coding and
Computing 2004 (ITCC 2004): Volume 2, pages 781-784, Las Vegas,
USA, 5-7 April 2004, IEEE Computer Society Press.
CAST’10 J. Castro, (2010), “Statistical Disclosure Control in Tabular Data”,
Chapter 6 of Privacy and Anonymity in information Management
Systems, Advanced Information and Knowledge Processing, pages 113-
131, Springer-Verlag London, ISBN 978-1-84996-237-7.
CATA’01 Dario Catalano, Rosario Gennaro, Nick Howgrave-Graham and Phong Q.
Nguyen, “Paillier’s Cryptosystem Revisited”, the Proceedings of the 8th
ACM conference on Computer and Communications Security (CCS’01),
pages 206-214, Philadelphia, Pennsylvania, USA, 6-8 November, 2001,
ACM NewYork Press.
CHAU’88 D. Chaum, C. Crépeau and I. Damgård, (1988), “Multiparty
Unconditionally Secure Protocols (Extended Abstract)”, the Proceedings
of the 20th
Annual ACM Symposium on Theory of Computing (STOC’88),
pages 11-19, Chicago, Illinois, USA, 2-4 May 1988, ACM New York
Press.
CHOR’95 B. Chor, O. Goldreich, E. Kushilevitz and M. Sudan, (1995), “Private
Information Retrieval”, the Proceedings of the 36th
Annual Foundations
of Computer Science, pages 41-50, Milwaukee, Wisconsin, USA, 23-25
October 1995, ACM New York Press.
CHOR’98 B. Chor, O. Goldreich, E. Kushilevitz and M. Sudan, (1995), “Private
Information Retrieval”, the Journal of the ACM: Volume 45(6), pages
965-982, November 1998, IEEE Computer Society Press.
CLIF’02a C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin and M. Y. Zhu, (2002),
“Tools for Privacy-preserving Distributed Data Mining”, ACM SIGKDD
Explorations Newsletter: Volume 4(2), pages 28-34, ACM New York
Press,.
CLIF’02b C. Clifton, M. Kantarcioglu and J. Vaidya, (2002), “Defining Privacy for
Data Mining”, the Proceedings of the National Science Foundation
Workshop on Next Generation Data Mining, pages 126-133, Baltimore,
Maryland, USA, 1-3 November 2002, AAAI/MIT Press,.
193
DAMG’03 I. Damgård and J. B. Nielsen, (2003), “Universally Composable Efficient
Multiparty Computation from Threshold Homomorphic Encryption”, the
Proceedings of the 23rd
Annual International Cryptology Conference on
Advances in Cryptology (CRYPTO 2003), Lecture Notes in Computer
Science: Volume 2729, pages 247-264, Santa Barbara, California, USA,
17-21 August 2003, Springer-Verlag Berlin, Heidelberg.
DICR’98 G. Di-Crescenzo, Y. Ishai and R. Ostrovsky, (1998), “Universal Service-
Providers for Database Private Information Retrieval (Extended
Abstract)”, the Proceedings of the 7th
Annual ACM Symposium on
Principles of Distributed Computing (PODC’98), pages 91-100, Puerto
Vallarta, Mexico, 28 June - 2 July 1998, ACM New York Press,.
DINU’03 I. Dinur and K. Nissim, (2003), “Revealing Information While
Preserving Privacy”, the Proceedings of the 22nd
ACM SIGMOD-
SIGACT-SIGART Symposium on Principles of Database Systems
(PODS’03), pages 202-210, San Diego, California, USA, 9-12 June 2003,
ACM New York Press.
DOLE’91 D. Dolev, C. Dwork and M. Naor, (1991), “Non-malleable
Cryptography”, the Proceedings of the 23rd
Annual ACM Symposium on
Theory of Computing (STOC’91), pages 542-552, New Orleans, Los
Angles, 5-8 May 1991, USA, ACM New York Press.
DOLE’03 D. Dolev, C. Dwork and M. Naor, (2003), “Nonmalleable
Cryptography”, SIAM Review: Volume 45(4), pages 727-784, Society
for Industrial and Applied Mathematics.
DOMI’01 J. Domingo-Ferrer, (2001), “A Quantitative Comparison of Disclosure
Control Method for Microdata”, Chapter 6 of Confidentiality, Disclosure,
and Data Access: Theory and Practical Applications for Statistical
Agencies, pages 111-133, Elsevier Press.
DOMI’07 J. Domingo-Ferrer, (2007), “A Three-Dimensional Conceptual
Framework for Database Privacy”, the Proceedings of the 4th
VLDB
Conference on Secure Data Management (SDM’07), pages 193-202,
Vienna, Austria, 23-24 September 2007, Springer-Verlag Berlin
Heidelberg.
DOMI’08 J. Domingo-Ferrer and M. Brad-Amoros, (2008), “Peer to Peer
Information Retrieval”, the Proceedings of the international conference
on Privacy in Statistical Databases (PSD’08), Lecture Notes in
Computer Science, Volume 5262, pages 315-323, Istanbul, Turkey, 24-
26 September 2008, Springer-Verlag Berlin, Heidelberg.
DOMI’09a J. Domingo-Ferrer, M. Maria Bras-Amorós , Q. Wu and J. Manjón,
(2009), “User-private Information Retrieval Based on a Peer-to-peer
Community”, Journal of Data & Knowledge Engineering: Volume
68(11), pages 1237-1252, June 2009, Elsevier Science Publishers B. V.
Amsterdam.
194
DOMI’09b J. Domingo-Ferrer, A. Solanas and J. Castella-Roca, (2009), “h(k)-
Private Information Retrieval from Privacy-uncooperative Queryable
Databases”, Journal of Online Information Review: Volume 33(4),
August 2009, Emerald Group Publishing Limited, Available on line at
http://www.deepdyve.com/lp/emerald-publishing/h-k-private-
information-retrieval-from-privacy-uncooperative-queryable-
1vpJ0E2SBB, last access May 2011.
DU’01a W. Du and M. J. Atallah, (2001), “Privacy-Preserving Cooperative
Scientific Computations”, the Proceedings of the 14th
IEEE Computer
Security Foundations Workshop (CSFW’01), pages 273-282, Cape
Breton, Nova Scotia, 11-13 June 2001, IEEE Computer Society Press.
DU’01b W. Du, (2001), “A Study of Several Specific Secure Two-party
Computation Problems”, A PhD thesis, Computer Science, Purdue
University.
DU’01c W. Du and M. J. Atallah, (2001), “Secure Multi-Party Computation
Problems and their Applications: A Review and Open Problems”, the
Proceedings of the New Security Paradigms Workshop 2001 (NSPW’01),
pages 11-20, Cloudcroft, New Mexico, USA, 10-13 September 2001,
ACM New York Press.
DU’01d W. Du and M. J. Atallah, (2001), “Privacy-Preserving Cooperative
Statistical Analysis”, the Proceedings of the Annual Computer Security
Applications Conference (ASAC’01), pages 102-110, New Orleans,
Louisiana, USA, 11-14 December 2001, IEEE Computer Society Press.
DU’01e W. Du and M. J. Atallah, (2001), “Protocols for Secure Remote Database
Access with Approximate Matching”, Tech Report Number CERIAS
Tech Report 2001-02, CERIAS, Purdue University, available online
from https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2001-
02.pdf , 24 pages, last access May 2011.
DU’02 W. Du and Z. Zhan, (2002), “A Practical Approach to Solve Secure
Multi-party Computation Problems”, the Proceedings of the New
Security Paradigms Workshop (NSPW’02), pages 127-135, Virginia
Beach, Virginia, USA, 23-26 September 2002, ACM New York Press.
DU’04 W. Du, Y. S. Han and S. Chen, (2004), “Privacy-preserving Multivariate
statistical Analysis: Linear Regression and Classification”, the
Proceedings of the 4th
SIAM International Conference on Data Mining,
pages 222-233, Lake Buena Vista, Florida, USA, 22-24 April 2004,
SIAM Press.
DUAN’07 Y. Duan, (2007), “P4P: A Practical Framework for Privacy-preserving
Distributed Computation”, A PhD thesis, Computer Science, University
of California, Berkeley.
ELLI’05 M. Elliot, (2005), “Statistical Disclosure Control”, the Proceedings of the
RSS Social Statistics Committee Conference on Linking Survey and
195
Administrative Data and Statistical Disclosure Control, pages 663-670,
London, UK, 16 Feb 2005, Elsevier Inc Press.
EMEK’06 F. Emekci, D. Agrawal, A. E. Abbadi and A. Gulbeden, (2006),
“Privacy-preserving Query Processing Using Third Parties”, the
Proceedings of the 22nd
International Conference on Data Engineering
(ICDE’06), pages 27-36, Atlanta, Georgia, USA, 3-8 April 2006, IEEE
Computer Society Press.
EURO’95 European Communities, (1995), “Directive 95/46/ec of the European
parliament and of the Council of 24 October 1995 on the protection of
individuals with regard to the processing of personal data and on the free
movement of such data”, Official Journal of the European Communities,
No I.(281):31–50.
EVFI’02 A. Evfimievski, (2002), “Randomization in Privacy-preserving Data
Mining”, ACM SIGKDD Explorations Newsletter: Volume 4(2), pages
43-48, ACM New York Press.
EVFI’03 A. Evfimievski, J. Gehrke and R. Srikant, (2003), “Limiting Privacy
Breaches in Privacy-preserving Data Mining”, the Proceedings of the
22nd
ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
Database Systems (PODS’03), pages 211-222, San Diego, California,
USA, 9-12 June 2003, ACM New York Press.
FAYY’10 E. Fayyoumi and B. J. Oommen, (2010), “A Survey on Statistical
Disclosure Control and Micro-aggregation Techniques for Secure
Statistical Databases”, Journal of Software - Practice & Experience -
Focus on Selected PhD Literature Reviews in the Practical Aspects of
Software Technology, Volume 40(12), pages 1161-1188, November
2010, John Wiley & Sons, Inc. New York, NY, USA.
FEIG’94 U. Feige, J. Killian and M. Naor, (1994), “A Minimal Model for Secure
Computation (Extend Abstract)”, the Proceedings of the 26th
Annual
ACM Symposium on the Theory of Computing, pages 554-563, Montréal
Québec, Canada, 23-25 May 1994, ACM New York Press.
FEIG’06 J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss and R. N.
Wright, (2006), “Secure Multiparty Computation of Approximations”,
ACM Transactions on Algorithm (TALG): Volume 2(3), pages 435-472,
ACM New York Press.
FERP’74 The U.S. Code, Title 20, Chapter 31, Subchapter III, Part 4, § 1232g,
“Family Educational and Privacy Rights”, Available on the U.S.
Government Printing Office Website:
http://frwebgate.access.gpo.gov/cgi-
bin/getdoc.cgi?dbname=browse_usc&docid=Cite:+20USC1232 and
Cornell University Law School, Legal Information Institute Website:
http://www.law.cornell.edu/uscode/html/uscode20/usc_sec_20_0000123
2---g000-.html, last access May 2011.
196
FIEN’07 S. E. Fienberg, A. F. Karr, Y. Nardi and A. B. Slavkovic, (2007), “Secure
Logistic Regression with Multi-Party Distributed Databases”, the
Proceedings of the Survey Research Methods Section, pages 3506-3513,
American Statistical Association, available on line:
http://www.amstat.org/sections/srms/proceedings/y2007/files/jsm2007-
000848.pdf , last access May 2011.
FRAN’97 M. K. Franklin and M. K. Reiter, (1997), “Fair Exchange with a Semi-
Trusted Third Party (extended abstract)”, the Proceedings of the 4th
ACM
Conference on Computer and Communications Security (CCS '97),
pages 1-5, Zurich, Switzerland, 1-4 April 1997, ACM New York Press.
FREE’04 M. Freedman, K. Nissim and B. Pinkas, (2004), “Efficient Private
Matching and Set Intersection”, Advances in Cryptology (EUROCRYPT
2004), Lecture Notes in Computer Science: Volume 3027, pages 1-19,
Interlaken, Switzerland, 2-6 May 2004, Springer-Verlag Berlin,
Heidelberg.
GENT’09 C. Gentry, (2009), “Fully Homomorphic Encryption Using Ideal
Lattices”, the Proceedings of the 41st Annual ACM Symposium on
Theory of Computing (STOC’09), pages 169-178, Bethesda, Maryland,
USA, 31 May – 2 June 2009, ACM New York Press.
GEOT’04 B. Goethals, S. Laur, H. Lipmaa and T. Mielikäinen, (2004), “On
Private Scalar Product Computation for Privacy-preserving Data
Mining”, the Proceedings of the 7th
Annual International Conference in
Information Security and Cryptology, Lecture Notes in Computer
Science: Volume 3506, pages 104-120, Seoul, South Korea, 2-3
December 2004, Springer-Verlag Berlin, Heidelberg.
GION’07 A. Gionis, H. Mannila, T. Mielikäinen and P. Tsaparas, (2007),
“Assessing data mining Results via Swap Randomization”, Journal of
ACM Transactions on Knowledge Discovery from Data (TKDD):
Volume 1(3) article 14, pages 14:1-14:32, ACM New York Press.
GOLD’82 S. Goldwasser and S. Micali, (1982), “Probabilistic Encryption & How to
Play Mental Poker Keeping Secret All Partial Information”, the
Proceedings of the 14th
Annual ACM Symposium on Theory of
Computing (STOC’82), pages 365-377, San Francisco, California, USA,
5-7 May 1982, ACM New York Press.
GOLD’84 O. Goldreich, (1984), “On Concurrent Identification Protocols”, the
Proceedings of the EUROCRYPT’84 Workshop on Advances in
Cryptology: Theory and Application of Cryptographic Techniques, pages
387-396, Paris, France, 9-11 April 1984, Springer-Verlag Berlin,
Heidelberg.
GOLD’87 O. Goldreich, S. Micali and A. Wigderson, (1987), “How to play any
mental game”, the Proceedings of the 19th
Annual ACM Symposium on
Theory of Computing, pages 218-229, New York, New York, USA, 25-
27 May 1987, ACM New York Press.
197
GOLD’89 S. Goldwasser, S. Macali and C. Rackoff, (1989), “The Knowledge
Complexity of Interactive Proof System”, SIAM Journal on Computing:
Volume 18(1), pages 186-208, Society for Industrial and Applied
Mathematics Philadelphia Press.
GOLD’97 S. Goldwasser, (1997), “Multi-party computations: Past and present”,
Invited paper to the Proceedings of the 16th
Annual ACM Symposium on
Principles of Distributed Computing (PODC’97), pages 1-6, Santa
Barbara, California, 21-24 Aug 1997, ACM Press.
GOLD’98 O. Goldreich, (1998), “Secure Multi-party computation (working draft)”,
available online: http://www.wisdom.weizmann.ac.il/~oded/pp.html, last
access May 2011.
GOLD’01 O. Goldreich, (2004), “Foundations of Cryptography: Volume 1 - Basic
Techniques”, Cambridge University Press, Date of Publication: June
2001, ISBN 0-521-79172-3.
GOLD’04 O. Goldreich, (2004), “Foundations of Cryptography: Volume 2 - Basic
Applications”, Cambridge University Press, Date of Publication: May
2004, ISBN 0-521-83084-2.
GRIN’97 C. M. Grinstead and J. L. Snell, (1997), “Introduction to Probability”,
American Mathematical Society Press, available online:
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probab
ility_book/amsbook.mac.pdf, last access May 2011.
HUND’10 A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, R. Lenz, J.
Naylor, E. C. Nordholt, G. Seri and P. D. Wolf, (2010), “Handbook on
Statistical Disclosure Control, version 1.2”, ESSNet SDC, January 2010,
available on line: http://neon.vb.cbs.nl/casc/SDC_Handbook.pdf, last
access May 2011.
HURK’98 C. A. J. Hurkens and S. R. Tiourine, (1998), “Models and Methods for
the Microdata Protection Problem”, Journal of Official Statistics:
Volume 14(4), Issue 4, pages 437-447, December 1998, Statistics
Sweden Press.
IOAN’03 I. Ioannidis and A. Grama, (2003), “An Efficient Protocol for Yao's
Millionaires' Problem”, the Proceedings of the 36th
Annual Hawaii
International Conference on System Sciences (HICSS’03), 6 pages
(abstract on page 205), Hilton Waikoloa Village, Hawaii, USA, 6-9
January 2003, IEEE Computer Society Press.
ISHA’07 Y. Ishai and A. Paskin, (2007), “Evaluating Branching Programs on
Encrypted Data”, the Proceedings of the 4th
Conference on Theory of
Cryptography (TCC’07), Lecture Notes in Computer Science: Volume
4392, pages 575-594, Amsterdam, The Netherlands, 21-24 February
2007, Springer-Verlag Berlin, Heidelberg.
JAGA’06 G. Jagannathan and R. N. Wright, (2006), “Privacy-Preserving Data
Imputation”, the Proceedings of the 6th
IEEE International Conference
198
on Data Mining - Workshops (ICDMW '06), pages 535-540, Hong Kong,
China, 18-22 December 2006, IEEE Computer Society Press.
KANT’02 M. Kantarcioglu and J. Vaidya, (2002), “An Architecture for Privacy-
preserving Mining of Client Information”, the Proceedings of the IEEE
International Conference on Privacy, Security and Data Mining
(PSDM2002), Volume 14, pages 37-42, Maebashi, Japan, 9-12
December 2002, Australian Computer Society Press.
KANT’03 M. Kantarcioglu and C. Clifton, (2003), “Assuring Privacy When Big
Brother is Watching”, the Proceedings of the 8th
ACM SIGMOD
Workshop on Research Issues in Data Mining and Knowledge Discovery
(DMKD’03), pages 88-93, San Diego, California, USA, 13 June 2003,
ACM New York Press.
Karg’03 H. kargupta, S. Datta, Q. Wang and K. Sivakumar, (2003), “On the
Privacy-preserving Properties of Random Data Perturbation”, the
Proceedings of the 3rd
IEEE International Conference on Data Mining
(ICDM’03), pages 99-106, Melbourne, Florida, USA, 19-22 November
2003, IEEE Computer Society Washington, DC, USA.
KARR’04 A. F. Karr, X. Lin, A. P. Sanil and J. P. Reiter, (2004), “Regression on
Distributed Databases via Secure Multi-Party Computation”, the
Proceedings of the 2004 Annual National Conference on Digital
Government Research (dg.o2004), 2 pages , Seattle, Washington, USA,
24-26 May 2004, Digital Government Society of North America.
KARR’05a A. F. Karr, X. Lin, A. P. Sanil and J. P. Reiter, (2005), “Secure
Regression on Distributed Databases”, Journal of Computational and
Graphical Statistics: Volume 14(2), pages 263-279, American Statistical
Association Press.
KARR’05b A. F. Karr, J. Feng, X. Lin, J. P. Reiter, A. P. Sanil and S. S. Young,
(2005), “Secure Analysis of Distributed Chemical Databases without
Data Integration”, Journal of Computer-aided Molecular Design:
Volume 19 (9-10), pages 739-747, September 2005, Springer.
KARR’06 A. F. Karr, X. Lin, A. P. Sanil and J. P. Reiter, (2006), “Secure Statistical
Analysis of Distributed Databases”, Statistical Methods in
Counterterrorism - Game Theory, Modelling, Syndromic Surveillance,
and Biometric Authentication, pages 237-262, Springer
Science+Business Media, LLC.
KARR’09a A. F. Karr, X. Lin, A. P. Sanil and J. P. Reiter, (2009), “Privacy-
Preserving Analysis of Vertically Partitioned Data Using Secure Matrix
Products”, Journal of Official Statistics: Volume 25(1), pages 125–138,
March 2009, Statistics Sweden Press, Available on-line at
http://www.jos.nu/Articles/abstract.asp?article=251125, last access May
2011.
KARR’09b A. F. Karr, (2009), “Secure Statistical Analysis of Distributed
Databases, Emphasizing What We Don't Know”, Journal of Privacy and
199
Confidentiality: Volume 1(2) article 5, pages 197-211, Department of
Statistics, Carnegie Mellon University Press, Available on-line at
http://repository.cmu.edu/jpc/vol1/iss2/5/, last access May 2011..
KILT’05 E. Kiltz, G. Leander and J. Malone-Lee, (2005), “Secure Computation of
the Mean and Related Statistics”, Theory of Cryptography, the
Proceedings of 2nd
Theory of Cryptography Conference (TCC 2005),
Lecture Notes in Computer Science: Volume 3378, pages 283-302,
Cambridge, Massachusetts, USA, 10-12 February 2005 Springer-Verlag
Berlin, Heidelberg.
KISS’06 L. Kissner, (2006), “Privacy-preserving Distributed Information
Sharing”, A PhD thesis, School of Computer Science, Carnegie Mellon
University.
KLEI’01 J. Kleinberg, C. H. Papadimitriou and P. Raghavan, (2001), “On the
Value of Private Information”, the Proceedings of the 8th
Conference
on Theoretical Aspects of Rationality and Knowledge (TARK’01), pages
249-257, Siena, Italy, 8-10 July 2001, Morgan Kaufmann Publishers Inc.
KUSH’97 E. Kushilevitz and R. Ostrovsky, (1997), “Replication Is Not Needed:
Single Database, Computationally-Private Information Retrieval”, the
Proceedings of the 38th
Annual Symposium on Foundations of Computer
Science, pages 364-373, Miami Beach, Florida , USA, 20-22 October
1997, IEEE Computer Society.
KUSH’00 E. Kushilevitz and R. Ostrovsky, (2000), “One-way Trapdoor
Permutations Are Sufficient for Non-trivial Single-server Private
Information Retrieval”, the Proceedings of the 19th
International
Conference on Theory and Application of Cryptographic Techniques
(EUROCRYPT’00), Lecture Notes in Computer Science: Volume 1807,
pages 104-121, Bruges, Belgium, 14-18 May 2000, Springer-Verlag
Berlin, Heidelberg.
LI’05 S. Li, Y. Dai and Q. You, (2005), “Secure Multi-party Computation
Solution to Yao's Millionaires' Problem Based on Set-inclusion”,
Progress in Natural Science 1745-5391: Volume 15(9), pages 851-856,
National Natural Science Foundation of China.
LI’06 S. Li, Y. Dai, D. Wang and P. Luo, (2006), “Symmetric Encryption
Solutions to Millionaire's Problem and Its Extension”, the Proceedings
of the 2006 1st International Conference on Digital Information
Management, pages 531-537, Bangalore, India, 6-8 December 2006,
IEEE Computer Society Press.
LI’07 N. Li, T. Li and S. Venkatasubramanian, (2007), “t-Closeness: Privacy
Beyond k-Anonymity and ℓ-Diversity”, the Proceedings of the IEEE
23rd
International Conference on Data Engineering (ICDE 2007), pages
106-115, Istanbul, Turkey, 15-20 April 2007, IEEE Computer Society
Press.
200
LI’08a S. Li, D. Wang, Y. Dai and P. Luo, (2008), “Symmetric Cryptographic
Solution to Yao's Millionaires' Problem and An Evaluation of Secure
Multiparty Computations”, Information Sciences: An International
Journal: Volume 178(1), pages 244-255, Elsevier Science Inc Press.
LI’08b Y. Li and H. Lu, (2008), “Disclosure Analysis and Control in Statistical
Databases”, the Proceedings of the 13th
European Symposium on
Research in Computer Security (ESORICS’08), pages 146-160, Malaga,
Spain, 6-8 October 2008, Springer-Verlag Berlin, Heidelberg.
LI’09 S. Li, D. Wang and Y. Dai, (2009), “Symmetric Cryptographic Protocols
for Extended Millionaires’ Problem”, Science in China Series F:
Information Sciences: Volume 52(6), pages 974-982, Springer-Verlag
Berlin, Heidelberg.
LIN’05 X. Lin, C. Clifton and M. Zhu, (2005), “Privacy-preserving clustering
with distributed EM mixture modelling”, Journal of Knowledge and
Information Systems: Volume 8(1), pages 68-81, July 2005, Springer-
Verlag New York.
LIN’09 X. Lin and A. F. Karr, (2009), "Privacy-preserving Maximum Likelihood
Estimation for Distributed Data," Journal of Privacy and Confidentiality:
Volume 1(2) article 6, pages 213-222, Department of Statistics, Carnegie
Mellon University Press.
LINC’04 P. Lincoln, P. Porras and V. Shmatikov, (2004), “Privacy-preserving
Sharing and Correction of Security Alerts”, the Proceedings of the 13th
Conference on USENIX Security Symposium (SSYM’04): Volume 13,
pages 239-254, San Diego, California, USA, 9-13 August 2003,
USENIX Association Berkeley Press.
LIND’00 Y. Lindell and B. Pinkas, (2000), “Privacy-preserving Data Mining”, the
Proceedings of the 20th
Annual International Cryptology Conference on
Advances in Cryptology (CRYPTO2000), Lecture Notes on Computer
Science, Volume 1880, pages 36-53, Santa Barbara, California, USA,
19-23 August 2000, Springer-Verlag London.
LIND’02a Y. Lindell, (2002), “On the Composition of Secure Multi-Party
Protocols”, A PhD thesis, Department of Computer Science and Applied
Mathematics, the Weizmann Institute of Science.
LIND’02b Y. Lindell and B. Pinkas, (2002), “Privacy-preserving Data Mining”,
Journal of Cryptology, Volume 15(3), pages 177-206, Springer.
LIND’03 Y. Lindell, (2003), “General Composition and Universal Composability
in Secure Multi-Party Computation”, the Proceedings of the 44th
Annual
IEEE Symposium on Foundations of Computer Science (FOCS’03),
pages 394-403, Cambridge, Massachusetts, USA, 11-14 October 2003,
IEEE Computer Society Press.
LIND’07 Y. Lindell and B. Pinkas, (2007), “An Efficient Protocol for Secure Two-
Party Computation in the Presence of Malicious Adversaries”, the
201
Proceedings of the 26th
Annual International Conference on Advances in
Cryptology (EUROCRYPT’07), Lecture Notes in Computer Science:
Volume 4515, pages 52-78, Barcelona, Spain, 20-24 May 2007 Springer-
Verlag Berlin, Heidelberg.
LIND’08 Y. Lindell, B. Pinkas and N. P. Smart, (2008), “Implementing Two-party
Computation Efficiently with Security Against Malicious Adversaries”,
the Proceedings of the 6th
international Conference on Security and
Cryptography for Networks (SCN’08), Lecture Notes in Computer
Science: Volume 5229, pages 2-20, New York, New York, USA, 3-6
June 2008, Springer-Verlag Berlin, Heidelberg.
LIPM’05 H. Lipmaa, (2005), “An Oblivious Transfer Protocol with Log-Squared
Communication”, the Proceedings of the 8th
International Conference
(ISC 2005), Lecture Notes in Computer Science: Volume 3650, pages
314-328, Singapore, 20-23 September 2005 Springer-Verlag Berlin,
Heidelberg.
LIPM’09 H. Lipmaa and B. Zhang, (2009), “Efficient Generalized Selective
Private Function Evaluation with Applications in Biometric
Authentication”, the Proceedings of the 5th
International Conference on
Information Security and Cryptology (Inscrypt’09), Lecture Notes in
Computer Science: Volume 6151, pages 154-163, Beijing, China, 12-15
December 2009, Springer-Verlag Berlin, Heidelberg.
LIU’10 M.C. Liu and N. Zhang, (2010), “A Solution to Privacy-preserving Two-
party Sign Test on Vertically Partitioned Data (P22NSTv) Using Data
Disguising Techniques”, the Proceedings of the International
Conference on Networking and Information Technology (ICNIT 2010),
pages 526-534, Manila, Philippines, 11-12 June 2010, IEEE Computer
Society Press.
LIU’11a M.C. Liu and N. Zhang, (2011), “A Cryptographic Solution to Privacy-
preserving Two-party Sign Test Computation on Vertically Partitioned
Data”, the Proceedings of the 2nd International Conference on
Electronics and Information Engineering (ICEIE2011), Volume 8, pages
8-16, Tianjin, China, 9-11 September 2011, Science and Technology
Press, Hong Kong.
LIU’11b M.C. Liu and N. Zhang, (2011), “A Cryptographic Solution to Privacy-
preserving Two-party Sign Test Computation on Vertically Partitioned
Data”, Advanced Material Research, Volume 403-408, Pages 1249 –
1257, Trans Tech Publications, Switzerland, doi:
10.4028/www.scientific.net/AMR.403-408.1249.
LUO’03 W. Luo and X. Li, (2003), “A Study of Secure Multi-Party Statistical
Analysis”, the Proceedings of the 2003 International Conference on
Computer Networks and Mobile Computing (ICCNMC’03), pages 377-
382, Shanghai, China, 20-23 October 2003, IEEE Computer Society
Press.
202
LUO’04 W. Luo and X. Li, (2004), “A Study of Secure Multi-party Elementary
Function Computation Protocols”, the Proceedings of the 3rd
International Conference on Information Security (InfoSecu’04), pages
5-12ACM, Shanghai China, 14-15 November 2004, New York Press.
LUO’05 W. Luo and X. Li, (2005), “A Study of Secure Multi-Party Elementary
Function Computation Protocols”, Journal of Communication and
Computer, ISSN1548-7709, Volume 2(5), pages 32-40, David
Publishing Company, USA.
MALK’04 D. Malkhi, N. Nisan, B. Pinkas and Y. Sella, (2004), “Fairplay - A
Secure Two-Party Computation System”, the Proceedings of the 13th
Conference on USENIX Security Symposium (SSYM’04), Volume 13,
pages 287-302, San Diego, California, USA, 9-13 August 2004,
USENIX Association Berkeley.
MART’07 D. J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke and J. Y. Halpern,
(2007), “Worst-Case Background Knowledge for Privacy-Preserving
Data Publishing”, the Proceedings of the IEEE 23rd
International
Conference on Data Engineering (ICDE 2007), pages 126-135, Istanbul,
Turkey, 15-20 April 2007, IEEE Computer Society Press,.
MATL’12 MathWorks support website, http://www.mathworks.co.uk/support/, last
access April 2012.
MENE’01 A. J. Menezes, P. C. van Oorschot and S. A. Vanstone, (2001), “Hand
Book of Applied Cryptography, 5th
Edition”, CRC Press, ISBN: 0-8493-
8523-7, available online http://www.cacr.math.uwaterloo.ca/hac/, last
access August 2011.
MICC’10 D. Micciancio, (2010), “Technical Perspective: A First Glimpse of
Cryptography's Holy Grail”, Communications of the ACM, Volume 53,
No 3, Page 96, ACM Press.
MUSE’08 J. Museux, M. Peeters and M. J. Santos, (2008), “Legal, Political and
Methodological Issues in Confidentiality in European Statistical System”,
the Proceedings of Privacy in Statistical Database (PSD2008), Lecture
Notes in Computer Science: Volume 5262, pages 324-334, Istanbul
Turkey, 24-26 September 2008, Springer-Verlag Berlin, Heidelberg.
NAOR’99 M. Naor, B. Oinkas and R. Sumner, (1999), “Privacy-preserving Auction
and Mechanism Design”, the Proceedings of the 1st ACM Conference on
Electronic Commerce (EC’99), pages 129-139, Denver, Colorado, USA,
3-5 November 1999, ACM New York Press.
OCR’03 U.S. Department of Health and Human Services Office for Civil Rights,
(2003), “Standards for Privacy of Individually Identifiable Health
Information Regulation Text; Security Standards for the protection of
Electronic protected Health Information; General Administrative
Requirements Including, Civil Money Penalties: Procedures for
Investigations, Imposition of Penalties, and Hearings”, OCR/HIPAA
203
Privacy/Security/Enforcement Regulation Text, 45 CFR Parts 160 and
164, Revised April 2003, Information Available Online:
http://aspe.hhs.gov/admnsimp/final/pvcguide1.htm,
Regulation Text (Unofficial Version, April 2003):
http://ahc.buffalo.edu/docs/Compliance-HIPAA-Privacy_Rule-
Security_Rule-Penalty_Information.pdf, last access may 2011.
OLIV’02 S. R. M. Oliveira and O. R. Zaiane, (2002), “Privacy-preserving Frequent
Itemset Mining”, the Proceedings of the IEEE ICDM Workshop on
Privacy, Security and Data Mining (PSDM 2002), CRPIT’14, pages 43-
54, Maebashi City, Japan, 9-12 December 2002, Australian Computer
Society Inc.
OLIV’04 S. R. M. Oliveira and O. R. Zaiane, (2004), “Achieving Privacy
Preservation when Sharing Data for Clustering”, the Proceedings of the
Secure Data Management, VLDB 2004 Workshop (SDM 2004), Lecture
Notes in Computer Science: Volume 3178, pages 67-82, Lake Buena
Vista, Florida, USA, 22-24 April 2004, Springer-Verlag, Berlin,
Heidelberg.
OSTR’07 R. Ostrovsky and W. E. Skeith, (2007), “A Survey of Single-Database
Private Information Retrieval: Techniques and Applications”, the
Proceedings of the 10th
International Conference on Practice and
Theory in Public-key Cryptography (PKC 2007), Lecture Notes in
Computer Science: Volume 4450/2007, pages 393-411, Beijing, China,
16-20 April 2007, Springer-Verlag Berlin, Heidelberg.
OSTR’08 R. Ostrovsky and W. E. Skeith, (2008), “Communication Complexity in
Algebraic Two-Party Protocols”, the Proceedings of the 28th
Annual
Conference on Cryptology: Advances in Cryptology (CRYPTO 2008),
Lecture Notes in Computer Science: Volume 5157, pages 379-396, Santa
Barbara, California, USA, 17-21 August 2008, Springer-Verlag Berlin,
Heidelberg.
PAIL’99a P. Paillier, (1999), “Public-Key Cryptosystems Based on Composite
Degree Residuosity Classes”, the Proceedings of the 17th
International
Conference on Theory and Application of Cryptographic Techniques
(EUROCRYPT’99), Lecture Notes in Computer Science: Volume 1592,
pages 223-238, Prague, Czech Republic, 2-6 May 1999, Springer-Verlag
Berlin Heidelberg.
PAIL’99b P. Paillier and D. Pointcheval, (1999), “Efficient Public-Key
Cryptosystems Provably Secure Against Active Adversaries”, the
Proceedings of the International Conference on the Theory and
Applications of Cryptology and Information Security, Advances in
Cryptology - ASIACRYPT '99, Lecture Notes in Computer Science
Volume: 1716, pages 165-179, Singapore, 14-18 November 1999,
Springer 1999, ISBN 3-540-66666-4.
204
PARA’06 R. Parameswaran, (2006), “A Robust Data Obfuscation Approach for
Privacy-preserving Collaborative Filtering”, A PhD thesis, School of
Electrical and Computer Engineering, Georgia Institute of Technology.
PARE’07 J. J. Parekh, (2007), “Privacy-preserving Distributed Event
Corroboration”, A PhD thesis, the Graduate School of Art and Science,
Columbia University.
PINK’03 B. Pinkas, (2003), “Fair Secure Two-Party Computation”, the
Proceedings of the 22nd
International Conference on Theory and
Applications of Cryptographic Techniques (EUROCRYPT’03), Lecture
Notes in Computer Science: Volume 2656, pages 87-105, Warsaw,
Poland, 4-8 May 2003, Springer-Verlag Berlin, Heidelberg.
RABI’81 M. O. Rabin, (1981), “How to Exchange Secrets with Oblivious
Transfer”, Technical Report TR-81, Aiken Computation Laboratory,
Harvard University, 22 pages, available online:
http://eprint.iacr.org/2005/187, last access May 2011.
RABI’89 T. Rabin, M. Ren-Or, (1989), “Verifiable Secret Sharing and Multiparty
Protocols with Honest Majority”, the Proceedings of the 21st Annual
ACM Symposium on Theory of Computing (STOC’89), pages 73-85,
Seattle, Washington, USA, 14-17 May 1989, ACM New York Press.
RAPP’04 D. K. Rapp, (2004), “Homomorphic Cryptosystems and Their
Applications”, A PhD thesis, Department of Mathematics, University of
Dortmund.
REIT’04 J. P. Reiter, C. N. Kohnen, A. F. Karr, X. Lin and A. P. Sanil, (2004),
“Secure Regression for Vertically Partitioned, Partially Overlapping
Data”, Digital Government II: Technical Reports, National Institute of
Statistical Sciences, USA, 7 pages, available online:
http://nisla05.niss.org/dgii/TR/secureEM2.pdf, last access May 2011.
RIZV’02 S. Rizvi and J. R. Haritsa, (2002), “Maintaining Data Privacy in
Association Rule Mining”, the Proceedings of the 28th
International
Conference on Very Large Data Bases (VLDB’02), pages 682-693, Hong
Kong, China, 20-23 August 2002, VLDB Endowment Press.
ROSE’97 W. Rosenkrantz, (1997), “Introduction to Probability and Statistics for
Scientists and Engineers”, New York ; London : McGraw-Hill 1997,
ISBN: 007053988X, 9780070539884
ROUG’06 M. Roughan and Y. Zhang, (2006), “Secure Distributed Data-Mining and
Its Application to Large-Scale Network Measurements”, ACM
SIGCOMM Computer Communication Review: Volume 36(1), pages 7-
14, ACM New York Press.
SALA’06 J. Salazar-Gonzalez, (2006), “Statistical Confidentiality: Optimization
Techniques to Protect Tables”, Computers and Operations Research:
Volume 35, pages 1638-1651, Elsevier Ltd. Publisher.
205
SAMA’98 P. Samarati and L. Sweeney, (1998), “Protecting Privacy when
Disclosing Information: k-Anonymity and its Enforcement through
Generalization and Suppression”, Technical Report SRI-CSL-98-04,
Computer Science Laboratory, SRI International, 19 pages, available
online: http://www.csl.sri.com/papers/sritr-98-04/, last access May 2011.
SANI’04 A. P. Sanil, A. F. Karr, J. P. Reiter, X. Lin, (2004), “Privacy-preserving
regression modelling via distributed computation”, the Proceedings of
the 10th
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’04), pages 677-682, Settle,
Washington, USA, 22-25 August 2004, ACM New York Press.
SHAM’79 A. Shamir, (1979), “How to Share a Secret”, Communications of the
ACM, Volume 22(11), pages 612-613, November 1979, ACM New York
Press.
SHAM’80 A. Shamir, (1980), “On the Power of Commutativity in Cryptography”,
the Proceedings of the 7th
Colloquium on Automata, Languages and
Programming, Lecture Notes in Computer Science: Volume 85, pages
582-595, Noordweijkerhout, The Netherland, 14-18 July 1980, Springer-
Verlag Berlin, Heidelberg.
SHAN’48 C. E. Shannon, (1948), “A Mathematical Theory of Communication”,
Bell System Technical Journal, Volume 27, Pages 379–423, 623–656,
July and October 1948, Available online http://cm.bell-
labs.com/cm/ms/what/shannonday/paper.html, last access Aug 2011.
SHEN’07 C. Shen, J. Zhan, D. Wang, T. Hsu and C. Liau, (2007), “Information-
Theoretically Secure Number-Product Protocol”, the Proceedings of the
6th
International Conference on Machine Learning and Cybernetics,
pages 3006-3011, Hong Kong, China, 19-22 Aug 2007, IEEE Computer
Society Press.
SHLO’07 N. Shlomo, (2007), “Statistical Disclosure Control Methods for Census
Frequency Tables”, International Statistical Review, Volume 75(2),
pages 199–217, August 2007, Blackwell Publishing Ltd, USA.
SHOR’07 T. S. Shores, (2007), “Applied Linear Algebra and Matrix Analysis”,
August 14 2007, Springer Science+Business Media, LLC, ISBN 978-0-
387-33195-9.
STIN’03 D. R. Stinson, (2003), “Combinatorial Designs-Constructions and
Analysis”, October 2003, Springer-Verlag New York, ISBN 0-387-
95487-2.
SPRE’00 P. Sprent and N. C. Smeeton, (2000), “Applied Nonparametric Statistical
Methods, Third Edition”, Chapman & Hall/CRC Texts in Statistical
Science, September 2000, ISBN-13: 978-1584881452.
SUBR’04 H. Subramaniam, R. N. Wright and Z. Yang, (2004), “Experimental
Analysis of Privacy-Preserving Statistics Computation”, the Proceedings
of the VLDB 2004 Workshop (SDM 2004), Lecture Notes in Computer
206
Science: Volume 3178, pages 325-333, Toronto, Canada, 30 August
2004, Springer-Verlag Berlin Heidelberg.
SWEE’02a L. Sweeney, (2002), “k-Anonymity: A Model for Protecting Privacy”,
International Journal on Uncertainty, Fuzziness and Knowledge-based
Systems, Volume 10 (5), pages 557-570, World Scientific Publishing Co.,
Inc. River Edge, NJ, USA.
SWEE’02b L. Sweeney, (2002), “Achieving K-Anonymity Privacy Protection Using
Generalization and Suppression”, International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, Volume 10(5), pages, 5871-
5880, World Scientific Publishing Co., Inc. River Edge, NJ, USA.
TRAU’07 J. F. Traub, Y. Yemini and H. Wozniakowski, (1984), “The Statistical
Security of a Statistical Database”, ACM transactions on Database
Systems (TODS), Volume 9(4), pages 672-679, ACM New York Press.
VAID’02 J. Vaidya and C. Clifton, (2002), “Privacy-preserving Association Rule
Mining in Vertically Partitioned Data”, the Proceedings of the 8th
ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD’02), pages 639-644, Edmonton, Alberta, Canada, 23-26
July 2002, ACM New York Press.
VAID’04 J. S. Vaidya, (2004), “Privacy-preserving Data Mining over Vertically
Partitioned Data”, A PhD thesis, Computer Science, Purdue University.
VAID’06 J. Vaidya, C. Clifton and M. Zhu, (2006), “Privacy-preserving data
Mining”, Advances in Information Security, Volume 19, ISBN-10:
0387258868, Date of Publication: 4 October 2007, Springer
Science+Business Media, Inc.
VERY’04 V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin and Y.
Theodoridis, (2004), “State-of-the-art in Privacy-preserving Data
Mining”, ACM SIGMOD Record: Volume 33(1), pages 50-57, ACM
New York Press.
VOUL’09 A. S. Voulodimos and C. Z. Patrikakis, (2009), “Quantifying Privacy in
terms of Entropy for Context Aware Services”, Identity in the
Information Society: Volume 2(2), pages 155-169, Springer Netherland.
WANG’06 D. Wang, C. Liau, Y. Chiang and T. Hsu, (2006), “Information
Theoretical Analysis of Two-Party Secret Computation”, the
Proceedings of the 20th
Annual IFIP WG 11.3 Working Conference on
Data and Application Security, Lecture Notes in Computer Science:
Volume 4127, pages 310-317, Sophia Antipolis, France, 31 July-2
August 2006, Springer-Verlag Berlin, Heidelberg.
WANG’09 I. Wang, C. Shen, J. Zhan, T. Hsu, C. Liau and D. Wang, (2009),
“Towards Empirical Aspects of Secure Scalar Product”, IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews - Special issue on information reuse and integration,
Volume 39(4), pages 440-447, July 2009, IEEE Computer Society Press.
207
WILL’96 L. Willenborg and T. D. Waal, (1996), “Statistical Disclosure Control in
Practice”, Lecture Notes in Statistics: Volume 111, ISBN 978-0-387-
94722-8, Year of Publication: 1996, Springer-Verlag New York, Inc.
WILL’01 L. Willenborg and T. D. Waal, (2001), “Elements of Statistical
Disclosure Control”, Lecture Notes in Statistics: Volume 155, ISBN
978-0-387-95121-8 , Year of Publication 2001, Springer-Verlag New
York, Inc.
WINK’05 W. E. Winkler, (2005), “Re-identification Methods for Evaluating the
Confidentiality of Analytically Valid Microdata”, Research Report
Series: Statistics #2005-09, Statistical research Division, U.S. Bureau
Census, Washington D.C., pages 50-69, Available online:
http://www.census.gov/srd/papers/pdf/rrs2005-09.pdf, last access May
2011.
WU’06 M. Wu and X. Ye, (2006), “Towards the Diversity of Sensitive Attributes
in k-Anonymity”, the Proceedings of the 2006 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent
Technology (WI-IAT 2006 Workshops/WI-IATW’06), pages 98-104,
Hong Kong China, 18-22 December 2006, IEEE Computer Society Press.
WU’06 X. Wu, Y. Wang and Y. Zheng, (2005), “Statistical Database Modelling
for Privacy-preserving Database Generation”, the Proceedings of the 15th
International Symposium on Methodologies for Intelligent Systems
(ISMIS 2005), Lecture Notes in Computer Science: Volume 3488, pages
382-390, Saratoga Springs, New York, USA, 25-28 May 2003, Springer-
Verlag, Berlin, Heidelberg.
WU’07 X. Wu, C. Chu, Y. Wang, F. Liu and D. Yue, (2007), “Privacy-preserving
Data Mining Research: Current Status and Key Issues”, the Proceedings
of the 7th
International Conference on Computational Science: Part III,
Lecture Notes in Computer Science: Volume 4489, pages 762-772,
Beijing, China, 27-30 May 2007, Springer-Verlag, Berlin, Heidelberg.
XIAO’06 X. Xiao and Y. Tao, (2006), “Personalized Privacy Preservation”, the
Proceedings of the 2006 ACM SIGMOD International Conference on
Management of Data (SIGMOD’06), pages 229-240, Chicago, Illinois,
USA, 26-29 June 2006, ACM New York Press.
YAO’82 A. C. Yao, (1982), “Protocols for secure computations”, the Proceedings
of the 23rd
Annual IEEE Symposium on Foundations of Computer
Science, pages 160-164, Chicago, Illinois, USA, 3-5 November 1982,
IEEE Computer Society Press.
YAO’86 A. C. Yao, (1986), “How to generate and exchange secrets”, the
Proceedings of the 27th
IEEE Symposium on Foundations of Computer
Science, pages 162-167, Portland, Oregon, USA, 21-23 October 1985,
IEEE Computer Society Press.
208
YAMP’06 A. Yampolskiy, (2006), “Efficient Cryptographic Tools for Secure
Distributed Computing”, A PhD thesis, Faculty of Graduate School,
Yale University.
ZHAO’05 W. Zhao and V. Varadharajan, (2005), “Efficient TTP-free Mental Poker
Protocols”, the Proceedings of International Conference on Information
Technology: Coding and Computing (ITCC 2005), Volume 1, pages
745-750, Las Vegas, Nevada, USA, 4-6 April 2005, IEEE Computer
Society Press.
209
Appendix………
A. Definitions of Privacy
A.1 Privacy with respect to Semi-honest Behaviour [GOLD’04]
Let **** }1,0{}1,0{}1,0{}1,0{: f be a probabilistic, polynomial time
functionality, where ),(1 yxf (respectively, ),(2 yxf ) denotes the first (respectively,
second) element of ),( yxf ). Let be a two-party protocol for computing f . The
view of the first (respectively, second) party during an execution of on ),( yx ,
denoted ),(1 yxVIEW (respectively, ),(2 yxVIEW ) is ),...,,,( 11 tmmrx (respectively,
),...,,,( 12 tmmry ), where 1r represents the outcome of the first (respectively, 2r
second) party’s internal coin tosses, and im represents the thi message it has
received. The output of the first (respectively, second) party after an execution of
on ),( yx , denoted ),(1 yxOUTPUT (respectively, ),(2 yxOUTPUT
), is implicit
in the party’s view of execution, and
)),(),,(( 21 yxOUTPUTyxOUTPUTOUTPUT .
(deterministic case) For a deterministic functionality f , we say that privately
computes f if there exists probabilistic polynomial-time algorithms, denoted 1S
and 2S , such that
** }1,0{,1}1,0{,11 )},({())},(,({(
yx
C
yxyxVIEWyxfxS , and
** }1,0{,2}1,0{,22 )},({())},(,({
yx
C
yxyxVIEWyxfyS .
Where yx . (Recall that C
denotes computational indistinguishability by (non-
uniform) families of polynomial-size circuits.)
(general case) We say that privately computes f if there exists probabilistic
polynomial-time algorithms, denoted 1S and 2S , such that
210
yx
C
yx yxOUTPUTyxVIEWyxfyxfxS ,1,11 ))},(),,({())},()),,(,({( , and
yx
C
yx yxOUTPUTyxVIEWyxfyxfyS ,2,22 ))},(),,({()},()),,(,({( .
We stress that ),(1 yxVIEW , ),(2 yxVIEW , ),(1 yxOUTPUT and
),(2 yxOUTPUT are related random variables, defined as a function of the same
random execution. In particular, ),( yxOUTPUTi
is fully determined by
),( yxVIEWi
.
211
A.2 Security in the Semi-honest Model [GOLD’04]
Let **** }1,0{}1,0{}1,0{}1,0{: f be a functionality, where ),(1 yxf (respectively,
),(2 yxf ) denotes the first (respectively, second) element of ),( yxf ), and let be
a two-party protocol for computing f .
Let ),( 21 BBB be a pair of probabilistic polynomial-time algorithms
representing parties’ strategies for the ideal model. Such a pair is admissible (in
the ideal model) if for least one iB we have vzvuBi ),,( , where u denotes the
party’s local input, v its local input, and z its auxiliary input. The joint
execution of f under B in the ideal model on input pair ),( yx and auxiliary
input z , denoted ),()(,
yxIDEALzBf
, is defined as
)),,(,(),),,(,(),,(( 2211 zyxfyBzyxfxByxf .
(That is, if iB is honest, then it just outputs the value ),( yxf i obtained from the
trusted party, which is implicit in this definition. Thus, our peculiar choice to feed
both parties with the same auxiliary input is immaterial, because the honest party
ignores its auxiliary input.)
Let ),( 21 AAA be a pair of probabilistic polynomial-time algorithms
representing parties’ strategies for the ideal model. Such a pair is admissible (in
the real model) if for least one }2,1{i we have outauxviewAi ),( for every
view and aux , where out is the output implicit in view. The joint execution of
under A in the real model on input pair ),( yx and auxiliary input z ,
denoted ),()(,
yxREALzA
is defined as
))),,((),),,((),,(( 2211 zyxVIEWAzyxVIEWAyxOUTPUT , where
)),(( yxOUTPUT and the ),(1 yxVIEWi
’s refer to the same execution and are
defined as in Definition 2.1a.
(Again, if iA is honest, then it just outputs the value ),( yxf i obtained from the
execution of , and we may feed both parties with the same auxiliary input.)
212
Protocol is said to securely compute f in the semi-honest model (secure
with respect to f and semi-honest behaviour) if for every probabilistic
polynomial-time pair of algorithms ),( 21 AAA that is admissible for the real
model, these exists a probabilistic polynomial-time pair of algorithms ),( 21 BBB
that is admissible for the ideal model such that
})},({)},({ ,,)(,
,,)(,
zyxzA
C
zyxzBf
yxREALyxIDEAL
,
where *}1,0{,, zyx such that yx and )( xpolyz .
213
A.3a Definition of the Malicious Adversaries in the Ideal Model [GOLD’04]
Let **** }1,0{}1,0{}1,0{}1,0{: f be a probabilistic, polynomial time
functionality, where ),(1 yxf (respectively, ),(2 yxf ) denotes the first (respectively,
second) element of ),( yxf ). Let be a two-party protocol for computing f . Let
),( 21 BBB be a pair of probabilistic polynomial-time algorithms representing
strategies in the ideal model. Such a pair is admissible (in the ideal malicious
model) if for at least one }2,1{i , called honest, we have urzuBi ),,( and
vvrzuBi ),,,( , for every possible value of u , z , r and v . Furthermore,
urzuBi ),,( must hold for both i ’s. The joint execution of f under B in the
local model (on input pair ),( yx and auxiliary input z ), denoted
),()(,
yxIDEALzBf
, is defined by uniformly selecting a random-tape r for the
adversary, and letting ),,,(),()(,
rzyxyxIDEALdef
zBf where ),,,( rzyx is defined
as follows:
In case Party 1 is honest, ),,,( rzyx equals
)))',(,,,(),',(( 221 yxfrzyByxf , where ),,(' 2 rzyBydef
.
In case Party 2 is honest, ),,,( rzyx equals
)),),,'(,,,( 11 yxfrzxB if )),'(,,,( 11 yxfrzxB
)),'()),,'(,,,( 211 yxfyxfrzxB otherwise, where, in both cases, ),,(' 1 rzxBxdef
.
214
A.3b Definition of the Malicious Adversaries in the Real Model [GOLD’04]
Let **** }1,0{}1,0{}1,0{}1,0{: f be a probabilistic, polynomial time
functionality, where ),(1 yxf (respectively, ),(2 yxf ) denotes the first (respectively,
second) element of ),( yxf ). Let be a two-party protocol for computing f . Let
),( 21 AAA be a pair of probabilistic polynomial-time algorithms representing
strategies in the real model. Such a pair is admissible (with respect to ) (for the
real malicious model) if at least one iA coincides with the strategy specifies by .
(In particular, this iA ignores the auxiliary input.) The joint execution of under
A in the real model (on input pair ),( yx and auxiliary input z), denoted
),()(,
yxREALzA
is defined as the output pair resulting from the interaction between
),(1 zxA and ),(2 zyA . (Recall that the honest iA ignores the auxiliary input z , and
so our peculiar choice of providing both iA ’s with the same z is immaterial.)
215
A.3c Security in the Malicious Model [GOLD’04]
Let **** }1,0{}1,0{}1,0{}1,0{: f be a probabilistic, polynomial time
functionality, where ),(1 yxf (respectively, ),(2 yxf ) denotes the first (respectively,
second) element of ),( yxf ). Let be a two-party protocol for computing f .
Protocol is said to securely compute f (in the malicious model) if for every
probabilistic polynomial-time pair of algorithm ),( 21 AAA that is admissible for
the real model (of Definition 2.2b), there exists a probabilistic polynomial-time pair
of algorithm ),( 21 BBB that is admissible for the ideal model (of Definition 2.2a)
such that
zyxzA
C
zyxzBfyxREALyxIDEAL ,,)(,,,)(,)},({)},({
, where
*}1,0{,, zyx , such that
yx and )( xpolyz .
(Recall that C
denotes computational indistinguishability by (non-uniform) families
of polynomial-size circuits). When the context is clear, we sometimes refer to as
a secure implementation of f .
216
A.4 The Security Definition w.r.t to Covert Adversaries. [AUMA’07]
Let 10 be a value called deterrence factor. Any attempt to cheat by an
adversary is detected by the honest parties with probability at least . Thus,
provided that is sufficient large, an adversary that wishes not to be caught
cheating, will refrain from attempting to cheat, lest it be caught doing so. Clearly,
the higher the value , the greater the probability that the adversary is caught and
thus the greater the deterrent to cheat. We therefore call out notation security in the
presence of covert adversaries with -deterrent. Note that the security guarantee
does not preclude successful cheating.
217
B. Protocol Prototypes
B.1 MATLAB.P22NSTP
% Alice.RpdfGP
% function [Dset_aug choose_Of_Random_functionX Dset_min_1 Dset_max_1
Dset_sig_1 Dset_avg_1] =step_1_alice_rpdfgp()
function [size_of_Dset_aug] = step_1_alice_rpdfgp() clear clc
%--start -- random number generation function for the joint party--------
-- X = ‘x.dat’; % call function x.dat
Dset = X; Dset_max = max(Dset); Dset_min = min(Dset); Dset_avg = mean(Dset); Dset_var = var(Dset);
% generate three random value for disguising mean&variance&max&min--- for i = 1 : 3 K_rand(i) = 1 + round(10.*rand(1,1)); %range from 1 to 10 end
Dset_avg_1 = (K_rand(1)*Dset_avg).*rand(1,1); Dset_var_1 = (K_rand(2)*Dset_var).*rand(1,1); Dset_sig_1 = sqrt(Dset_var_1); Dset_min_1 = Dset_min.*rand(1,1); Dset_max_1 = Dset_min + (K_rand(3)*Dset_max).*rand(1,1); % to make sure
Dest_max_1 is larger than Dset_min_1
for i = 4 : 5 K_rand(i) = 1 + round(10.*rand(1,1)); %range from 1 to 10, generate
further two multiplier to increase the range of random numbers. end
size_of_X = length(X);
%use uniform distribution to generate random value Generate_random_X_using_uniform = abs(round(Dset_min_1 +
Dset_max_1.*rand(1,K_rand(4)*size_of_X)));
%use normal distribution to generate random value Generate_random_X_using_normal = abs(round(Dset_avg_1 +
Dset_sig_1.*randn(1,K_rand(4)*size_of_X))); %--end -- random number generation function for Alice-------------
choose_Of_Random_functionX = round(rand(1,1)); if choose_Of_Random_functionX == 1 Dset_aug = [Dset Generate_random_X_using_uniform]; else Dset_aug = [Dset Generate_random_X_using_normal]; end
218
size_of_Dset_aug = size(Dset_aug,2);
save Dset Dset; save Dset_aug Dset_aug; save Dset_avg_1 Dset_avg_1 save Dset_var_1 Dset_var_1 save Dset_sig_1 Dset_sig_1 save Dset_min_1 Dset_min_1 save Dset_max_1 Dset_max_1 save choose_Of_Random_functionX choose_Of_Random_functionX
%row_size_of_Dset_aug_step_1_alice_rpdfgp = size(Dset_aug,1); %column_size_of_Dset_aug_step_1_alice_rpdfgp = size(Dset_aug,2); %size_of_Dset_step_1_alice_rpdfgp = size(Dset,2);
219
% Alice.DOP
% function [Task_1_Result permutation_Matrix_1]
=step_2_alice_dop(Dset_aug, choose_Of_Random_functionX, Dset_min_1,
Dset_max_1, Dset_sig_1, Dset_avg_1)
function [rowsize_of_Task_1_Result,
colsize_of_Task_1_Result]=step_2_alice_dop() clear clc
load('Dset.mat'); load('Dset_aug.mat'); load('choose_Of_Random_functionX.mat'); load('Dset_min_1'); load('Dset_max_1'); load('Dset_sig_1'); load('Dset_avg_1');
%first stage permutation matrix, generate permutation matrix 1
Length_of_Dset_aug = length(Dset_aug); % permutation matrix 1 is used to swap the order of xi and xi', it
only % makes this computation smooth, it doesn't contribute any security % protection, cause it has to be sent to Bob at the later stage.
permutation_Matrix_1 = eye(Length_of_Dset_aug,Length_of_Dset_aug); for i=1 : Length_of_Dset_aug index = round( 1 + (Length_of_Dset_aug-1).*rand(2,1)); temp1 = zeros(Length_of_Dset_aug,1); temp1 = permutation_Matrix_1(:,index(1)); permutation_Matrix_1(:,index(1)) =
permutation_Matrix_1(:,index(2)); permutation_Matrix_1(:,index(2)) = temp1; end %in this step, I generate an eye matrix first, then swap the column
of this %eye matrix, need to think is this method ok? does this method come
with %any weakness?
Dset_aug_permuted_by_permutation_Matrix_1 = (Dset_aug *
permutation_Matrix_1);
Permuted_diagonal_Matrix =
zeros(Length_of_Dset_aug,Length_of_Dset_aug); for i = 1 : Length_of_Dset_aug Permuted_diagonal_Matrix(i,i) =
Dset_aug_permuted_by_permutation_Matrix_1(i); end
%generate random matrix M, it is either generated by normal
distribution or %uniform distribution according to the previous computation result
rand_number_Matrix = zeros(Length_of_Dset_aug,Length_of_Dset_aug); if choose_Of_Random_functionX == 1 for i = 1 : Length_of_Dset_aug for j = 1 : Length_of_Dset_aug rand_number_Matrix(i , j) = abs(round(Dset_min_1 +
220
Dset_max_1.*rand(1,1))); end end else for i = 1 : Length_of_Dset_aug for j = 1 : Length_of_Dset_aug rand_number_Matrix(i , j) = abs(round(Dset_avg_1 +
Dset_sig_1.*randn(1,1))); end end end % to make the diagonal entry to be 0. for i = 1 : Length_of_Dset_aug rand_number_Matrix(i,i) = 0; end
Step_1_ranndomized_matrix = rand_number_Matrix +
Permuted_diagonal_Matrix;
% the permutation matrix 2 is used to swap the order of row order
of xi', % it is used later to generate the table to record the swap order
of this % permutation.
permutation_Matrix_2 = eye (Length_of_Dset_aug,Length_of_Dset_aug); for i=1 : Length_of_Dset_aug index = round( 1 + (Length_of_Dset_aug-1).*rand(2,1)); temp2 = zeros(Length_of_Dset_aug,1); temp2 = permutation_Matrix_2(:,index(1)); permutation_Matrix_2(:,index(1)) =
permutation_Matrix_2(:,index(2)) ; permutation_Matrix_2(:,index(2)) = temp2; end
Task_1_Result = permutation_Matrix_2 * Step_1_ranndomized_matrix;
% this step is to record the swap of order changed by permutation
matrix 1 for i = 1:length(permutation_Matrix_1) row_swap_order_by_pM1(i) = find (permutation_Matrix_1(i,:));
%to find the row to where xi is swapped by T1 column_swap_order_by_pM1(i) = find (permutation_Matrix_1(:,i));
% to find the column to where xi is swapped by T1 end
% this step is to record the swap of order changed by permutation
matrix 2. for i = 1:length(permutation_Matrix_2) row_swap_order_by_pM2(i) = find (permutation_Matrix_2(i,:));
%to find the row to where xi' is swapped by T2 column_swap_order_by_pM2(i) = find (permutation_Matrix_2(:,i));
% to find the column to where xi' is swapped by T2 end
rowsize_of_Task_1_Result = size(Task_1_Result,1); colsize_of_Task_1_Result = size(Task_1_Result,2);
save Task_1_Result Task_1_Result; save permutation_Matrix_1 permutation_Matrix_1; save permutation_Matrix_2 permutation_Matrix_2; save row_swap_order_by_pM1 row_swap_order_by_pM1;
221
save column_swap_order_by_pM1 column_swap_order_by_pM1; save row_swap_order_by_pM2 row_swap_order_by_pM2; save column_swap_order_by_pM2 column_swap_order_by_pM2;
222
% Bob.RpdfGP
% function [choose_Of_Random_functionY Dset_Y_aug Dset_Y_avg_1
Dset_Y_var_1 Dset_Y_sig_1 Dset_Y_min_1 Dset_Y_max_1]
=step_3_bob_rpdfgp(Task_1_Result)
function [size_of_Dset_Y_aug]=step_3_bob_rpdfgp()
clear clc
load('Task_1_Result.mat');
%--start -- random number generation function for Bob------------- Y = ‘y.dat’;
Dset_Y = Y;
Dset_Y_max = max(Dset_Y); Dset_Y_min = min(Dset_Y); Dset_Y_avg = mean(Dset_Y); Dset_Y_var = var(Dset_Y);
% generate three random values for disguising mean & variance & max
& min ---
for i = 1 : 3 K_randY(i) = 1 + round(10.*rand(1,1)); end Dset_Y_avg_1 = (K_randY(1)*Dset_Y_avg).*rand(1,1); Dset_Y_var_1 = (K_randY(2)*Dset_Y_var).*rand(1,1); Dset_Y_sig_1 = sqrt(Dset_Y_var_1); Dset_Y_min_1 = Dset_Y_min.*rand(1,1); Dset_Y_max_1 = Dset_Y_min + (K_randY(3)*Dset_Y_max).*rand(1,1);
% know the length from task 1 result, this length is used to
generate the % same number of rendom noises in Dset_Y. Length_from_Task_1_Y = length(Dset_Y);
Length_from_Task_1 = length(Task_1_Result);
%use uniform distribution to generate random value Generate_random_Y_using_uniform = abs(round(Dset_Y_min_1 +
Dset_Y_max_1.*rand(1,Length_from_Task_1 - Length_from_Task_1_Y)));
%use normal distribution to generate random value Generate_random_Y_using_normal = abs(round(Dset_Y_avg_1 +
Dset_Y_sig_1.*randn(1,Length_from_Task_1 - Length_from_Task_1_Y))); %--end -- random number generation function for Alice-------------
%choose random function with probability=0.5 choose_Of_Random_functionY = round(rand(1,1)); if choose_Of_Random_functionY == 1 Dset_Y_aug = [Dset_Y Generate_random_Y_using_uniform]; else Dset_Y_aug = [Dset_Y Generate_random_Y_using_normal]; end
size_of_Dset_Y_aug = size(Dset_Y_aug,2);
223
save choose_Of_Random_functionY choose_Of_Random_functionY; save Dset_Y Dset_Y; save Dset_Y_aug Dset_Y_aug; save Dset_Y_avg_1 Dset_Y_avg_1; save Dset_Y_var_1 Dset_Y_var_1; save Dset_Y_sig_1 Dset_Y_sig_1; save Dset_Y_min_1 Dset_Y_min_1; save Dset_Y_max_1 Dset_Y_max_1;
save Length_from_Task_1 Length_from_Task_1;
224
% Bob STCP
% function[]=step_4_bob_stcp(Dset_Y_aug, permutation_Matrix_1,
Task_1_Result)
function[rowsize_of_STCP_Result, colsize_of_STCP_Result,
num_of_STCP_Result]=step_4_bob_stcp() clear clc
load('Dset_Y_aug.mat'); load('permutation_Matrix_1.mat'); load('Task_1_Result.mat'); load('Length_from_Task_1.mat')
%Bob use permutation matrix 1 from Alice to permute Dset_Y_aug Dset_Y_aug_permuted_by_permutation_Matrix_1 = Dset_Y_aug *
permutation_Matrix_1;
%according to the length from X(Dset), Bob generates the same quantity of %replicated rows Dset_Y_aug_permuted_replicated =
repmat(Dset_Y_aug_permuted_by_permutation_Matrix_1,Length_from_Task_1,1);
%Bob computes the differences between Dset_Y_aug and task_1 DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated = Task_1_Result -
Dset_Y_aug_permuted_replicated;
%using data transformation technique to generate P,Q,R positive_Result = [1;0;0]; zero_Result = [0;1;0]; negative_Result = [0;0;1];
number_Of_Rows =
size(DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated,1); number_Of_Columns =
size(DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated,2);
% each comparison result will be a column vector in relation to P,Q,R. Comparison_result_Matix_for_each_Row = zeros(3, number_Of_Columns,
number_Of_Rows);
for i=1 : number_Of_Rows result_posi =
find(DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated(i,:)>0); result_zero =
find(DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated(i,:)==0); result_nega =
find(DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated(i,:)<0); for j=1 : number_Of_Columns if DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated(i,j) >
0 Comparison_result_Matix_for_each_Row(1,j,i) = 1; elseif
DiffMatrix_Task_1_Result_Dset_Y_aug_permuted_replicated(i,j) < 0 Comparison_result_Matix_for_each_Row(3,j,i) = 1; else Comparison_result_Matix_for_each_Row(2,j,i) = 1; end end end
225
for i = 4 : 4 K_randY(i) = 1 + round(10.*rand(1,1));%K_randY(4), times 10. end
%choose security parameter Length_of_added_column = round(number_Of_Rows +
K_randY(4)*number_Of_Rows.*rand(1,1));
%generate random noise matrices, for case n<=400, these random noises are %either 0 or 1, to fill the below part of the DiffMatrix.
Added_Random_Matrix = zeros(Length_of_added_column, number_Of_Rows,
number_Of_Columns);
for i=1 : Length_of_added_column for j=1: number_Of_Rows for k=1 : number_Of_Columns Added_Random_Matrix(i,j,k) = round( 1 + rand(1,1)) - 1; end end end
for i = 1 : number_Of_Columns Randomized_DiffMatrix{i}=
[Comparison_result_Matix_for_each_Row(:,:,i);Added_Random_Matrix(:,:,i)]; end
%generate permutation matrix 3 to permute the 'Randomized_DiffMatrix's permutation_Matrix_3 = eye(Length_of_added_column + 3,
Length_of_added_column + 3); for i=1 : Length_of_added_column + 3 index = round( 1 + (Length_of_added_column + 3 - 1).*rand(2,1)); temp = zeros(Length_of_added_column + 3,1); temp = permutation_Matrix_3(:,index(1)); permutation_Matrix_3(:,index(1)) = permutation_Matrix_3(:,index(2)) ; permutation_Matrix_3(:,index(2)) = temp; end
%permute every 'Randomized_DiffMatrix's by permutation matrix 3 STCP_Result = zeros(Length_of_added_column + 3, number_Of_Columns,
number_Of_Rows); for i = 1 : number_Of_Rows STCP_Result(:,:,i) = permutation_Matrix_3 * Randomized_DiffMatrix{i} end
% to record the swap order changed by permutation matrix 3 for i = 1:length(permutation_Matrix_3) row_swap_order_by_pM3(i) = find (permutation_Matrix_3(i,:)); %to find
the row to where xi is swapped by T3 column_swap_order_by_pM3(i) = find (permutation_Matrix_3(:,i)); % to
find the column to where xi is swapped by T3 end
rowsize_of_STCP_Result = size(STCP_Result,1); colsize_of_STCP_Result = size(STCP_Result,2); num_of_STCP_Result = number_Of_Rows;
save STCP_Result STCP_Result; save row_swap_order_by_pM3 row_swap_order_by_pM3; save column_swap_order_by_pM3 column_swap_order_by_pM3;
226
% Alice.DEP
function[size_of_Sum_of_Summation_Difference_2]=step_5_alice_dep()
clear clc
load('STCP_Result.mat'); load('row_swap_order_by_pM2'); load('Dset.mat'); load('column_swap_order_by_pM1.mat');
% to pick up the column vectors where xi' is located, and combine
these vectors % into a matrix. this matrix is used to computa the possible sign
test results from all possible PQR combination. %
% find the columns of all xi' from STCP result Summation_Difference =
zeros(length(STCP_Result(:,1,1)),length(STCP_Result(1,:,1))); for i = 1 : length(STCP_Result(1,:,1)) Summation_Difference(:,row_swap_order_by_pM2(i)) =
STCP_Result(:,row_swap_order_by_pM2(i),i) end
% find the columns of all xi from Summation_difference Summation_Difference_2 =
zeros(length(Summation_Difference(:,1)),length(Dset)); for i = 1 : length(Dset) Summation_Difference_2(:,i) =
Summation_Difference(:,column_swap_order_by_pM1(i)) end
% calculate the possible values of P Q R... Sum_of_Summation_Difference_2 =
zeros(1,length(Summation_Difference_2)) for i = 1 : length(Summation_Difference_2) Sum_of_Summation_Difference_2(i) =
sum(Summation_Difference_2(i,:)); end
size_of_Sum_of_Summation_Difference_2 =
size(Sum_of_Summation_Difference_2,2);
save Sum_of_Summation_Difference_2 Sum_of_Summation_Difference_2;
227
% Bob.PRP
function[p,h]=step_6_bob_prp()
clear clc
load('Sum_of_Summation_Difference_2.mat') load('column_swap_order_by_pM3.mat') load('Dset_Y.mat')
index_of_real_PQR = zeros(1,3); for i = 1 : 3 index_of_real_PQR(i) = column_swap_order_by_pM3(i) end
P = Sum_of_Summation_Difference_2(index_of_real_PQR(:,1)); Q = Sum_of_Summation_Difference_2(index_of_real_PQR(:,2)); R = Sum_of_Summation_Difference_2(index_of_real_PQR(:,3));
% transform into matlab's sign test form. Test_X = zeros(1,length(Dset_Y)); for i = 1 : P Test_X(i) = 1 end for i = P+1 : P+Q+R Test_X(i) = 0 end
Test_Y = zeros(1,length(Dset_Y)); for i = 1 : P+Q Test_Y(i) = 0 end for i = P+Q+1 : P+Q+R Test_Y(i) = 1 end
[p,h] = signtest(Test_X,Test_Y);
save p p;
save h h;
228
B.2 MATLAB.P22NSTC
% Matlab.Main.P22NSTC
a = ‘x.dat’;
b = ‘y.dat’;
nElement = size(a,2);
javaclasspath ({'c:\'});
seckey = generate.Paillier.PrivateKey(128); pubkey = seckey.generatePublicKey();
%The encryption of X by Alice tic; for i=1 : nElement eni_x(i) =
generate.Paillier.Encryption(java.prog(a(i)),pubkey); end
%The encryption of Y by Bob
for i=1 : nElement eni_y(i) =
generate.Paillier.Encryption(java.prog(b(i)),pubkey); end TIME.encrypt = toc;
%End of encryption
229
% STTP.DSP
% DSP start by STTP tic; nElementLow = round(nElement/3); nElementUpper = round(2*nElement/3); % number of n1 nElementNewA = round(nElementLow + (nElementUpper-
nElementLow).*rand(1)); % number of n2 nElementNewB = nElement - nElementNewA; PMVector1 = randperm(nElement); for i = 1 : nElementNewA NewA(i) = a(PMVector1(i)); end for i = 1 : nElementNewB NewB(i) = a(PMVector1(i + nElementNewA)); end TIME.dsp = toc; % DSP end
tic
230
% STTP.DRP for Alice
% STTP does DRP for Alice % number of n1' nElementNewAA = round(round(nElementNewA*nElementNewA/2) +
(nElementNewA*nElementNewA - nElementNewA -
round(nElementNewA*nElementNewA/2)).*rand(1)); % generate indices for ti and ui for i = 1 : nElementNewAA indexT1(i) = round(1 + (nElementNewA-1).*rand(1)); indexU1(i) = round(1 + (nElementNewA-1).*rand(1)); end % generate new noise items for i = 1 : nElementNewAA NewAA(i) = NewA(indexT1(i)) + NewA(indexU1(i)); end NewAAA = [NewA NewAA]; nElementNewAAA = size(NewAAA,2); % permute the NewAAA PMVectorAAA = randperm(nElementNewAAA); for i = 1 : nElementNewAAA permNewAAA(i) = NewAAA(PMVectorAAA(i)); end % End of DRP for Alice
231
% STTP.DRP for Bob
% STTP does DRP for Bob % number of n2' nElementNewBB = round(round(nElementNewB*nElementNewB/2) +
(nElementNewB*nElementNewB - nElementNewB -
round(nElementNewB*nElementNewB/2)).*rand(1)); % generate indices for ti and ui for i = 1 : nElementNewBB indexT2(i) = round(1 + (nElementNewA-1).*rand(1)); indexU2(i) = round(1 + (nElementNewA-1).*rand(1)); end % generate new noise items for i = 1 : nElementNewBB NewBB(i) = NewB(indexT2(i)) + NewB(indexU2(i)); end NewBBB = [NewB NewBB]; nElementNewBBB = size(NewBBB,2); % permute the NewBBB PMVectorBBB = randperm(nElementNewBBB); for i = 1 : nElementNewBBB permNewBBB(i) = NewBBB(PMVectorBBB(i)); end % End of DRP for Bob TIME.drp = toc;
% individual decrypt time by Alice and Bob, encrypt time = decrypt
time TIME.decryptAAA = (TIME.encrypt/nElement)*(nElementNewA +
nElementNewAA)*2; TIME.decryptBBB = (TIME.encrypt/nElement)*(nElementNewB +
nElementNewBB)*2;
% Additively homomorphic operation tic; for i=1 : nElement eni_xy(i) = eni_x(i).multi(eni_y(i)); end Time.cipheraddition = toc; % End of Additively homomorphic operation
tic % Data Transformation % Data Transformation for NewAAA by Alice for i = 1 : nElementNewAAA if NewAAA(i) > 0 TRMNewAAA(1,i) = 1; elseif NewAAA(i) < 0 TRMNewAAA(3,i) = 1; else TRMNewAAA(2,i) = 1; end end
% Data Transformation for NewBBB by Bob for i = 1 : nElementNewBBB if NewBBB(i) > 0 TRMNewBBB(1,i) = 1; elseif NewBBB(i) < 0 TRMNewBBB(3,i) = 1; else TRMNewBBB(2,i) = 1;
232
end end TIME.transformation = toc;
tic; % restore NewAAA order PMVectorAAA2 = randperm(nElementNewAAA); for i = 1 : nElementNewAAA permNewAAA2(i) = NewAAA(PMVectorAAA2(i)); end
% restore NewBBB order PMVectorBBB2 = randperm(nElementNewBBB); for i = 1 : nElementNewBBB permNewBBB2(i) = NewBBB(PMVectorBBB2(i)); end TIME.restore = toc;
tic % calaulate PQR & sign test P = sum (a); Q = sum (a); R = sum (a); TIME.pqr = toc;
tic; p = signtest (a,b); TIME.signtest = toc;
TIME.total = TIME.encrypt + TIME.dsp + TIME.drp + TIME.decryptAAA +
TIME.decryptBBB + TIME.transformation + TIME.restore + TIME.pqr +
TIME.signtest;