INTRODUCTION
With the development of technology, ultra large databases appear. At
the same time, people show much concerns about the informational privacy.
The problem of privacy-preserving is first proposed by Agrawal and Srikant
(2000). Data mining is an efficient tool to discover valuable knowledge
from a great deal data. But the general data mining is based on the assumption
that complete access to data is available, either in centralized or federated
form. In fact, privacy and security concerns often prevent sharing of
data which may not be possible due to either legal or commercial reasons.
In legal terms, medical data cannot be released for any purpose without
appropriate anonymization. In commercial terms, data is often a valuable
business asset. So, the research direction in data mining incorporating
privacy concerns becomes fruitful. People also show much interest in the
problem of privacy-preserving data mining. The environment of data mining
can be classified into two cases: centralized and distributed data. In
the centralized environment, the focus is on the query restriction. In
the distributed environment, the data is distributed in different sites.
Now, in the distributed environment, the data is classified into horizontally
partitioned and vertically partitioned case. For horizontally partitioned
data (Yu et al., 2006a; Kantarcioglu and Clifton, 2004), the data
matrix A`s rows including all input features are divided into groups belonging
to different entities. While for vertically partitioned data, the data
matrix A`s columns are divided into groups belonging to different entities
(Yu et al., 2006b; Vaidya and Clifton, 2002). The reason is that
feature values for each individual are stored as rows of a data matrix,
while a specific feature values for all individuals are represented by
columns of a data matrix.
Data mining has many applications in the real world. One of the most
important and widely found problems is that of classification. The goal
of classification is to build a model that can predict the value of one
variable, based on the other variables. Support Vector Machine is one
of the most important classification methods in machine learning. While
SVM is one of the most actively developed classification methodology,
so there has been wide interest in privacy-preserving support vector machine
(PPSVM) classifier. Recently, people show much interest in the area of
privacy-preserving support vector machines, PPSVM for short. We briefly
review some of the relevant work. PPSVM classifiers were gained on vertically
partitioned data by adding random perturbations to the data by Yu et
al. (2006b). PPSVM classifiers using nonlinear kernels on horizontally
partitioned data were obtained by Yu et al. (2006a). But it can
only deal with binary feature data. Other privacy preserving classifying
techniques include wavelet-based distortion and rotation perturbation
(Chen and Liu, 2005).
The kernel matrix computed in the above study is equal or close to that
computed with the original data. They refer to compute inner product securely
(Ioannidis et al., 2002). In this study, we propose another approach
to the classification and testing. The kernel matrix computed in our method
is not equal to that computed with original data. But the accuracy is
comparable with the original data. We propose a unified high efficient
novel model of PPSVM classifiers on horizontally and vertically partitioned
data that is different from the existing PPSVM classifiers for such partitioned
data. During the classification and testing, we use the perturbed data.
The study is based on the two ideas. The first idea is that for a given
data matrix A, we add the same random real number to the same column of
the matrix A. Though the rows of the matrix A are modified, the relative
location of each row of matrix A is not changed. We prove that this model
has comparable accuracy with the original data. Besides, this experiment
gives the evidence. The second idea is that each entity generates a random
row vector that has the same dimension with that of its input feature
space. The entity holds it privately. In the training and testing process,
it does not change. The data for training and testing is perturbed with
the same data. By employing the two ideas, we shall describe algorithms
that protect the privacy of each partitioned data either horizontally
partitioned or vertically partitioned. The generated PPSVM classifier
has comparable accuracy comparing to that of an ordinary SVM classifier.
We first describe the notation. An mxn matrix A represents m data points
in an n-dimensional input space. An mxn diagonal matrix D contains the
corresponding labels (i.e., +1 or -1) of the data points in A. (A class
label Dii, or di for short, corresponds to the i-th
data points xi in A). All vectors are column vectors unless
transposed to a row vector by a prime superscriptT. The scalar
(inner) product of two vectors x and y in the n-dimensional space Rn
is denoted by xTy.
We assume that we can public the class labels D corresponding to data
matrix A. Each entity does not collude and does follow the proposed protocol
correctly.
SVM OVERVIEW
The Support Vector Machine (SVM) is a classifier, originally proposed
by Vapnik (2000), that finds a maximal margin separating hyperplane between
two classes of data. The aim of Support Vector Classification (SVC) is
to devise a computationally efficient way of learning good separating
hyperplane in a high dimensional feature space. SVM finds the separating
hyperplane ((ω.x)+b = 0) that maximizes the margin, denoting the
distance between the hyperplane and the closest data points (i.e., support
vectors). When we can not find a hyperplane to separate the data perfectly,
we introduce the soft margin. To maximize the margin while minimizing
the error, the standard SVM solution is formulated into the following
primal program:
ξi is the slack variable in the constraint. This shows
that SVM allows error or the soft margin. The slack or error is minimized
in the objective function. C is the margin parameter which is used to
tune the margin size and the error. The weight vector ω and the bias
b will be computed by this optimization problem. Then we can determine
the class of a new data object x by f(x) = (ω.x)+b, where the class
is positive if f(x)>0, or else negative.
To solve the primal problem, we can solve its dual problem by applying
the Lagrange multipliers:
The coefficients α are to be computed from the dual problem. Where
is the kernel function where
for linear kernel. We can also apply a nonlinear kernel for K(xi,
xj) (e.g.,
for RBF kernel, K(xi · xj) = ((xi
· xj))+1p for polynomial kernel.) The weight
vector
and
Then the classification function
can be computed.
For more information, see the tutorial written by Burges (1998).
A UIFIED MODEL FOR PPSVM
If we can access the data points xi, i = 1,2,.., l freely, we
can get the original SVM model:
Solving the problem, we can use the method earlier. While privacy and
security prevent access the data points ξi, i = 1,2,..,
l freely. One technique for privacy and security is perturbation (Agrawal
and Ramakrishnan, 2000), that is perturbs the data points with some data.
In this research, we give the same bias to the same column, that is each
data point has the same bias. In the privacy-preserving support vector
machines, privacy and security prevent to access the original data.
In present model, the original data points xi εRn,
i = 1,2,.., l become xi+a, where aεRn.
We can access the data points xi+a, i = 1,2,.., l. The
purpose of introducing vector a is to realize the requirement of privacy
and security. The PPSVM model is as following.
We give the proof that using the PPSVM model has comparable accuracy
comparing with that of the original SVM model.
Lemma 1: In the original SVM model (3) and the PPSVM model (4),
the distance of the data points xi, xj is the same
with that of the corresponding perturbed data points xi+a,xj+a.
Lemma 2: The original SVM model Eq. 3 is equivalent
to:
Similarly, the PPSVM model Eq. 4 is equivalent to:
Lemma 3: Given the same parameter C, solving the original SVM
model (5), we get the decision function f(x) = (ω·x)+b and
solving the PPSVM model we get the decision function
then we can get
for linear kernel. For the original SVM model (5), the Lagrange multipliers
are αi, i = 1,2,..., l. For the PPSVM model (6),
the Lagrange multipliers are
If the slack variable ξi is same in the original model
(5) and PPSVM model (6) (In Lemma 4, we proof that the assumption is reasonable),
then
for both linear and nonlinear kernels.
Proof: For the original SVM model (5), the Lagrange function is:
where αi are Lagrange multipliers. Its KKT conditions
are:
We replace original data points xi, i = 1,2,...,l with
xi + αi, i = 1,2,...,l and replace ω,
xi,αi,b with
in the Eq. 8-13. We get the KKT conditions of the PPSVM
model (6).
From Eq. 8-9, we get
For linear kernel:
So, ω does not change for linear kernel.
If the slack variable ξi is same in the original model
(5) and PPSVM model (6), the parameter C is the same in (5) and (6), from
the Eq. 10, we can know that the Lagrange multipliers
do not change from the Eq. 10. That is:
From Eq. 14, we know that if xi is support
vector for original SVM model, then xi+a is support vector
for PPSVM model.
Lemma 4: If
(ω,b,ξ) is the optimal solution of the original SVM model Eq.
5, then there is
such that
is the feasible solution of the PPSVM model Eq. 6.
Proof: Because
so there is
subjected
to
So:
(ω,b,ξ) is the optimal solution of the original SVM model Eq.
5, so
Then
So
is the feasible solution of the PPSVM model Eq. 6
Theorem 1: For linear kernel, given the same parameter C, the
decision function f (x) = (ω·x)+b got from the original SVM
model has the same function value with the decision function 
Proof: From Lemma 3, we know
Then for any i, αi>0:
So we get Theorem 1.
Suppose from the original SVM model (5), we get the support vectors are
xsi, i = 1,2,...,k. We give the notations. xjmax,
i = max {xsij, i = 1,2,...,k}, xjmin, i = max {xsij,
i = 1,2,...,k}, j =1,2,..,n, where xsij is the j-th element
of vector xsi. We define xmax = (x1max,
x2max,..., xnmax), xmin = (x1min,
x2min,..., xnmin).
Theorem 2: Given the same parameter C in the PPSVM model and the
original SVM model. We get the classification function f(x) = sgn((ω·x)+b)
solving the original SVM model. We get the classification function
The original data distribution function is P(x), its mean value is μ
and standard dubitation is σ.
Under the assumption that the slack variable ξi is same
in the original model (5) and PPSVM model (6), for 0-1 loss function c(x,d,f(x)),
the difference of the expect risk of classification function f(x) and
is not more than P(xmax)-P(xmin). Then we say that
the PPSVM model has comparable accuracy comparing with the original SVM
model.
Proof: From Lemma 3, we know that under the assumption the corresponding
support vectors do not change, that is if xi is a support vector
then xi+a is a support vector, too. From lemma 3, we know that
the difference of classification of f(x) and
only may be the data points between the support vectors. Then difference
of expect risk of decision functions
is:
We get that the PPSVM model has comparable accuracy comparing with the
original SVM model.
PRIVACY-PRESERVING METHODS AND QUANTIFYING PRIVACY
Present basic approach to preserving privacy is to let users provide
a modified value of its original data. We consider the technique for modifying
values:
| Table 1: |
Quantifying privacy |
|
Value perturbation: Return a value xi+r instead of
xiεR where r is a random value drawn from certain distribution.
We consider mainly two random distributions. In fact, we can draw r from
more random distributions which can improve the privacy:
| • |
Uniform: The random variable has a uniform distribution,
between [-α, α]. The mean of the random variable is 0. |
| • |
Gaussian: The random variable has a normal distribution,
with mean μ = 0 and standard deviation σ. |
Table 1 shows the privacy preserved by the methods
of Uniform and Gaussian. We fix the perturbation of an entity. So, it
has the same bias for the same attribute of all data points.
We use the same quantifying methods with the study by Agrawal and Srikant
(2000). For quantifying privacy provided by a method, we use a measure
based on how closely the original values of a modified attribute can be
estimated. If it can be estimated with c% confidence that a value x lies
in the interval [x1, x2], then the interval width
x1, x2 defines the amount of privacy at c% confidence
level.
Table 1 shows the privacy offered by the different
methods using this metric.
ALGORITHMS OF PRIVACY-PRESERVING SUPPORT VECTOR MACHINES
Before give the algorithms, we define a notation firstly.
Definition 1
:
If data
and
vector
then A
a
= ((x1+a))T, (x2+a)T,...,
(xn+a)T.
Lemma 5: For two vectors a,bεRn and AεRmxn,
then A
a
b
= A
b
a.
Algorithm of PPSVM on horizontally partitioned data: The dataset
using to obtain a classifier consists of m points in Rn represented
by the m rows of the matrix AεRmxn. Each row contains
values for n features associated with a specific individual, while each
column contains m values of a specific feature associated with m different
individuals. The data matrix A is divided into q blocks of m1,
m2, ..., mq rows with m1+m2+...+mq
= m,
The blocks of matrix A are held by q entities P1, P2,
..., Pq, respectively. That is the entity Pi owns
the data Ai. Each entity is unwilling to make public or share
its data with others. Furthermore, they do not reveal its data for various
reasons such as commercial reason. Data is often a valuable business asset
for a factory. Each row of AεRmxn is labeled belonging
to the classes by a corresponding diagonal matrix DεRmxn.
Class label Dii, or di for short, corresponds to
the i-th data points
in A. We assume that entities make public the matrix D.
In the horizontally partitioned case, we need an entity P0
that is independent from the participating entities P1,P2....,
Pq. It actually owns no data and its task is only computing
the global classification function. It is the protocol initiator. Each
entity Pi, except entity P0, generates a random
vector ri, i = 1,2,..., q having the same dimension with the
row vector dimension of data matrix Ai. The random vector ri
is privately held by the entity and is never made public and keeps the
same during the procedure of training and testing. The elements of these
random vectors are drawn from the Uniform distribution or Gaussian distribution
or other distributions. Each row of the data owned by all entities is
added to the sum of random vectors ri, i = 1,2,..., q. Then
the data matrix A becomes A
r1
r2
...
rq.
After that, we can use the data set and the PPSVM model to compute the
classification function f(x) = sgn((ω·x)+b). When one entity
has a new data point x whose class label the entity wants to know. The
initiator P0 can predict its label using the data x+r1+r2+...+rq.
Now we completely describe algorithms for horizontally partitioned data
using PPSVM model. The whole algorithm is described as Algorithm 1 and
Algorithm 2.
ALGORITHM 1
| • |
Step 0 (initialize): P0 is the protocol
initiator. The entity Pi (i = 1,2,..,q) generates its random
vector ri and holds it privately. The vector has the same
dimension with the row of its dataset owned and keeps the same during
the procedure of training and test. |
| • |
Step 1: The first step begins by the initiator P0
sending an empty dataset to its neighbor P1. Note that,
to prevent P0 from contributing to training set, P1
must reject P0`s start request if P0 sends a
non-empty set in the first round. |
| • |
Step 2: For i = 1,2,...q-1 entity Pi sends the
dataset
to entity Pi+1. |
| • |
Step 3: Pq send the dataset |
to P0.
| • |
Step 4: P0 removes and
send the remaining dataset
to entity P1. |
| • |
Step 5: For i = 1,2,...,q-2, Pi sends the dataset
to P0 and sends |
to Pi+1.
| • |
Step 6: Pq-1 sends
to P0. |
| • |
Step 7: P0 gets dataset |
(Lemma 5) . Go to Algorithm 2.
ALGORITHM 2
| • |
Step 0 : P0 now owns the dataset |
All entities make public the class matrix Dll = ±1,
l = 1,2,..,q for the data matrices Ai, i = 1,2,..q.
| • |
Step 1: Solve the PPSVM`s dual problem |
Get the optimal α*.
Get the classification function is f(x) = sgn ((ω·x)+b).
| • |
Step 3: For each new xεRn obtained
by an entity, that entity wants to know its label. Execute the Algorithm
1. At last, P0 gets the data x+r1+..+rq.
Using the classification function in step 2, we can get the label
of x+r1+..+rq which is also the label of x. |
Assuming that the participating entities do not collude with the initiator
and do follow the given protocol correctly, the initiator successfully
acquires the global SVM model without disclosing any information on the
private data of any entity.
One may ask why not one entity adds its random row vector to each row
of its private dataset and sends to the initiator entity P0
directly. The reason is that random row vectors generated by entities
may be different. Then the same column of the data horizontally partitioned
has different bias values comparing to the original data. Thus it effects
the classification accuracy. Another problem is that the initiator entity
P0 must send the empty dataset to its neighbor P1
in the step 1 of the Algorithm 1. The key is that if P0 sends
data, in the end, it will know the bias value of its data. Because each
row of the dataset has the same bias value (r1+..+rq),
the initiator will know the data privately owned by the entities. It violates
the privacy of the datasets.
From the algorithms described above, we know that each row of all datasets
has the same bias value. The initiator P0 constructs the PPSVM
model using the perturbed dataset. Then we discuss how to perform testing/classification
using the PPSVM model constructed by our algorithms. Suppose the entity
Pi has the dataset Newi to know its classification.
The entity Pi could simply send the original dataset Newi
to initiator P0 to classify. But this would violate this constraint
that the initiator should learn nothing about the private data of any
entity. Furthermore, the training datasets have the bias value. Using
the original dataset to classify has a low accuracy. To be corresponding
with the training dataset, the testing/classifying datasets should carry
out the algorithms described above. Then the initiator would have the
perturbed datasets for testing/classifying using the constructed PPSVM
model.
As discussed earlier, the initiator is prohibited from contributing data
to the testing set. That is to prevent it from obtaining any information.
Thus P1 must reject P0`s start request if P0
sends a nonempty set in the Step 1 of Algorithm 1. No data is disclosed
to any entity in this process. The details of security can be found in
the paper by Agrawal and Srikant (2000).
Algorithm of PPSVM on vertically partitioned data: Suppose there
are P1,P2,...,Pq participating entities,
having the datasets A1,A2,...,Aq, correspondingly.
All of these entities hold some feature values of the same group. The
group is constituted with m individuals. The entity Pi has
fi features, i = 1,2,..,q, where f1+f2+...+fq
= n. The whole data is A = A[A1 A2 ...Aq]εRmxn.
Suppose there is a volunteer entity P1 that is willing to compute
the global PPSVM model. We call it initiator and other entities are called
passive entities. Similar to the horizontally partitioned case, our goal
is to obtain the perturbed dataset that has the same bias value for each
column. Entity P1(i = 2,3,...,q) generates a random vector
ri whose dimension is the same with the row dimension of it
owned dataset. Each element of the vector ri is drawn from
Uniform distribution, Gaussian distribution or other distributions. Entity
Pi holds the vector ri privately. We now describe
the algorithm (Algorithm 3) for PPSVM model on vertically partitioned
data.
ALGORITHM 3
| • |
Step 0 (Initialize): Choose the initiator Pi.
Entity Pi (i = 2,3,...,q) generates its random vector ri.
The random vector does not change during the training and testing.
All q entities agree on the same labels (matrix D) for each data point.
If an agreement on D is not possible, we can use semi-supervised learning
(Fung and Mangasarian, 2001b) to handle such data points. It is the
future work. |
| • |
Step 1: For i = 2,3,...,q, entity Pi (i = 2,3,...,q)
executes the operation :Ai ri.
Then the entity Pi sends Ai ri
to the initiator P1. Thus the initiator Pi now
owns the data Ai ri,
i = 2,3,..q. The initiator P1 owns the data
where 0 is the vector whose elements are zero. Its data point is represented
with
|
| • |
Step 2: Using all data, initiator P1 solve the
PPSVM dual problem: |
Get the optimal α*.
The classification function is f(x) = sgn ((ω·x)+b).
| • |
Step 4: For each new x obtained by an entity
Pi, i = 1,2,...,q, that entity want to know its label.
Entity Pi sends the data x+ri to initiator P1,
if i = 1, then ri = 0. |
Using the classification generated in step 3, we can get the label of
x+ri that is also the label of x.
In the Algorithm 3, the initiator P1 does not send its privately
owned data to any entity, so it does not disclose any information of its
owned data. For the entity Pi (i = 2,3,,..,q), it sends dataset
Ai
ri
to initiator P1. Because each element of the random vector
ri is generated from certain distribution, which is known to
Pi, it is hard for the initiator P1 to get any information
of Ai. When entity Pi has data Newi to
know its class label, Pi sends Newi
ri
to Pi as described in Algorithm 3. There is not any information
disclosed in the protocol. Our algorithm is secure. The details of security
can be found in the stduy by Agrawal and Srikant (2000).
EXPERIMENTAL RESULTS
The present research demonstrate the efficiency of our protocol in three
ways. First, our experiments show that the accuracy obtained using this
PPSVM is the same as that of SVM when the data is centralized. Second,
our experiments show that using our approach, entities can obtain classifiers
with lower misclassification error than that obtained using only one entity`s
data alone.
We choose three datasets from UCI repository. To simulate a situation
in which each entity has only a subset of the feature space for each data
point, we randomly distribute the features among the entities such that
each entity receives about the same number features. Similarly, to simulate
the situation in which each entity has only a subset of the individuals
with the same features, we randomly distribute the data points among the
entities such that each entity has about the same number individuals.
The SVM model is created using the LIBSVM. During the experiments, we
choose the same SVM model for one dataset.
Comparison of this approach with classifiers obtained when the data
is original centralized: We demonstrate the classification accuracy
of our approach is the same with that of the case when the data is centralized.
The accuracy is shown in the Table 2 no matter horizontally
partitioned or vertically partitioned.
From the Table 2, we can observe that the accuracy
of our approach obtained the same with that obtained when the data is
original centralized. Besides, the accuracy does not change with the increasing
of entities number.
| Table 2: |
Accuracy comparison of our approach no matter horizontally
or vertically partitioned data case with the original centralized
data case |
|
| Table 3: |
Comparison the classification accuracy of our approach
with that of using only one entity`s original data in the horizontally
partitioned case |
|
| Table 4: |
Comparison the classification accuracy of our approach
with that of using only one entity`s original data in the vertically
partitioned case |
|
Comparison of our approach with classifiers obtained when the data
is part of original centralized: We compare the classification accuracy
of this approach with that of using only one entity`s original data. The
accuracy is shown in the Table 3 in the case of horizontally
partitioned. Table 4 shown the accuracy of vertically
partitioned case.
From Table 3 and 4, we can see that
this approach has lower classification error comparing with that using
only one entity`s original data in the most cases, especially when the
number of the data points is not large.
DISCUSSION
Though our approach is efficient and secure, there are some challenges.
The requirement of an entrusted intermediator seems overly restrictive.
A key challenge for the future work is to remove the need of the intermediator,
while still efficiently constructing the SVM model. In the vertically
partitioned case, we can deal with data points using semi-supervised learning
methods when an agreement on class label D is not possible. We also can
immerge our approach with other SVM model such as RSVM (Lee and Huang,
2007), PSVM (Fung and Mangasarian, 2001a) etc. to improve the accuracy
and reduce running time. There is room for future work in PPSVM. Privacy-preserving
Support Vector Regression may be a prospective study direction.
CONCLUSION
This study proposes a unified model for privacy-preserving classification
on horizontally and vertically partitioned data. We use the distinct character
of classification. Giving the same bias value to the same column of the
training and testing datasets, it does not affect the accuracy of classification.
During the data passing, no data information is disclosed. Our protocol
is secure. Besides, our experiment shows that our approach is scalable
as the increasing of entities number. It costs lower time. PPSVM proposed
can deal with the real number data points comparing with the paper by
Yu et al. (2006a). The initiator in our approach can deal with
the received dataset as its original dataset. It is convenient for initiator
to construct the global SVM model and test the new data.
ACKNOWLEDGMENT
This research is supported by the National Science Foundation of China
under Grant (No. 10571109, 60603090 and 90718011).