The goal of dimensionality reduction is to map the high dimensional data into
a low dimensional space and compress the data in size. PCA and MDS are classical
linear techniques. Recently, manifold learning algorithms have been developed
to perform nonlinear dimensionality reduction, such as ISOMAP (Tenenbaum
et al., 2000), LLE (Roweis and Saul, 2000),
LTSA (Zhang and Zha, 2004), Laplacian Eigenmaps (Belkin
and Niyogi, 2003), Multilayer autoencoders (Hinton and
Salakhutdinov, 2006) and so on.
ISOMAP is a generalization of MDS. It is becoming more and more difficult for
MDS and ISOMAP to solve the full eigenvector problem with the increasing sample
size. To overcome their fault, LMDS (Landmark MDS) (De Silva
and Tenenbaum, 2003) only preserves distances or geodesic distances between
some landmark points. However, random landmark point sets often lead to worse
In this study, we proposed a fast manifold learning algorithm which only preserved the similarities or dissimilarities between each point and its k nearest neighbors as well as some random far points. Experiments showed that the method was effective.
OVERVIEW OF MDS AND ISOMAP
Classical MDS is realized by matching the ordinal level of similarities or
dissimilarities among all objects in two spaces. For the goal, Kruskal
(1964) proposed a measure called the stress function which shows that how
much the embedded data express the original data:
where, dij and Δij are respectively the similarity or dissimilarity in the original space and in the embedded space.
ISOMAP replaces Euclidean distances with the geodesic distances on the base
of MDS. The algorithm contains three steps. Firstly, the neighborhood graph
is built. Then, the geodesic distances between all pairs are estimated by calculating
the shortest path distances in the neighborhood graph. Lastly, the low dimensional
coordinates are found by applying MDS to the geodesic distance matrix. The detail
can be referred to Tenenbaum et al. (2000).
THE FAST ALGORITHM
The basic idea: In MDS and ISOMAP, solving the full eigenvector problem leads to higher complexity. Contrarily, LLE, Laplacian Eigenmaps and LTSA have lower complexity because the eigenvector problem is sparse in these methods. If MDS and ISOMAP also solve a sparse eigenvector problem, its execution will greatly speed up. Therefore, it is very necessary to construct a spare similarity or dissimilarity matrix. How can we build such a matrix? We can only preserve the similarities or dissimilarities between some points but all pairs.
LMDS has lower complexity through only preserving the distances between some
landmark points. But random landmark sets dont often represent the true
topology of the manifold and lead to worse results. Then how to reduce the complexity
on the premise of guaranteeing the embedded quality? The problem can be solved
by combining the local structure and the global structure of the manifold. We
can preserve the similarities or dissimilarities between each point and some
points including its nearest neighbors and some far points.
Let X contain N samples. The goal of dimensionality reduction is to force Δij to match dij for each possible pair (i, j). To do that, the stress function is used to measure the error between the pairwise distances in two spaces in MDS. The embedding is carried out by minimizing Eq. 1. Because only some similarities or dissimilarities are preserved in the new method, Eq. 1 can be changed to:
where, Ω is the set containing the nearest neighbors and some random far points. We call Eq. 2 as the part stress function.
According to the part stress, we can introduce the part energy function:
where, λ is a learning rate parameter. Let:
Equation 3 is rewritten as:
The minimization is made through the deepest gradient descent method. Let ∇Ei is the gradient with respect to yi, then:
Equation 5 gives the method to update the embedding coordinates for each step.
The description: The method is an iterative procedure where the low
dimensional coordinates are gained through preserving the similarities or dissimilarities
between each point and some points. During the iterative process, the part stress
function is used as the object function. To simplify the algorithm, Euclidean
distances are used to construct the neighborhood graph and the dissimilarity
measure is only employed. The fast algorithm can be described as follows:
||X containing N samples, the neighborhood size k1,
the number k2 of far points, the embedded dimension d, the max
learning steps L, the min part stress s and the learning rate λ
||The embedded coordinates Y
||Decide k1 nearest neighbors and randomly choose
k2 far points for each point
||Define the dissimilarity measure
||Estimate the dissimilarities between each point and its k2
||Initialize Y with PCA or random method and set l = 1
||Figure out the part stress. If the part stress is smaller
than s, return Y
||Update Y. Execute the following steps for each point i
||Calculate Δyi according to Eq.
||Set yil+1 = yli+Δyi
||Set l = l+1. If l is bigger than L, return Y, else go to step
If Euclidean distances are used as the dissimilarity measure, the step 3 can be skipped. If the geodesic distances are as the dissimilarity measure, they can be approximated with the shortest path on the neighborhood graph. The shortest path can be computed with Dijkstras algorithm.
The complexity analysis: In the method, a main computational cost is the iterative process. Its time complexity is Ο (c (k1+k2) N) (c is a constant). Usually, the sum of k1 and k2 is far below N. However, a full NxN eigenvector problem has Ο (N3) complexity in MDS and ISOMAP. At the same time, a full NxN dissimilarity matrix needs to be stored. The space complexity is Ο (N2). But, we only store a (k1+k2)xN matrix. Its space complexity is Ο((k1+k2)N). Another main cost is to estimate the dissimilarities between all pairs. The time complexity of the part is equal for the new algorithm and MDS. Apparently, the complexity of the fast algorithm is great degraded.
We tested the performance of the new algorithm and discussed some facts influencing
the performance. The experiments included four parts: comparison of the performance
between the new method and ISOMAP, comparison of the efficiency of two methods,
effectiveness of the number of far points and the initial value to the embedded
||Two data sets including 1000 samples. (a) Swiss roll data
set and (b) S-curve data set
In experiments, we used Kruskal (1964) stress to measure
the performance and noted it as ks. The distance in two spaces was, respectively
Euclidean distance and the geodesic distance. All programs were implemented
Comparison of the performance between two methods: To compare the performance
of ISOMAP and the proposed algorithm, we tested them on two data sets: Swiss
roll data set (Tenenbaum et al., 2000) and S-curve
data set (Roweis and Saul, 2000). Each set contained
1000 samples (Fig. 1).
Firstly, we ran ISOMAP once on each data set and computed the stress. Then, the fast algorithm was executed on the two sets. k2 was 5, 10, 15 and 20. The initial value was set with PCA. For each case, the algorithm ran 50 times. We counted the min value, the max value, the mean value and the standard deviation of ks of 50 times. The four values were, respectively noted as min, max, mean and sd. The results were given in Table 1 and 2. Here, k1 was 7 on Swiss roll data set and 20 on S-curve data set.
From the experiments, the stress from the new algorithm was close to or smaller than that from ISOMAP. Namely, the performance of the fast algorithm was close or superior to that of ISOMAP through only preserving the distances between each point and some objects.
Comparison of the executing efficiency of two methods: Theoretically, the complexity of the new algorithm was lower than that of ISOMAP. We compared their executing efficiency on Swissroll data set. We figured out the time of the iterative process for the fast method. For ISOMAP, we computed the time of solving the full NxN eigenvector problem. To validate the relation between the executing efficiency and the sample size, we ran them with different sample size. The size was from 1000 to 8000. Here, k1= 7 and k2 = 5, 10, 15, 20. The initial value was set with PCA.
|| Results on Swissroll data set
|| Results on S-curve data set
||Efficiency comparison between two algorithms for different
The max iterative step was 100 and the min part stress was 0.02. For each case, the mean time of 10 times was calculated. The results were shown in Table 3.
From the results, ISOMAP could not run when the sample size was more than 6000 because the matrix was too big. For smaller sample set, the executing efficiency of ISOMAP was higher than one of the new algorithm. The efficiency of the fast method was higher and higher than one of ISOMAP with the increasing sample size.
Effectiveness of the number of far points to the embedded results: In the algorithm, one needed randomly select some far points before running the program. How many should we select far points? Namely, how does the number of far points affect the results? To this end, we ran the method on Swiss roll data set with 1000 samples for various numbers of far points. Here, k1 = 7 and k2 = 0, 1, 3, 5, 10, 20. For each case, we computed statistic values of ks for 50 times. During each execution, the max iterative step was 200 and the initial value was set with PCA. The results were given in Table 4.
|| Values of ks for different far points
|| Results for the initial value with PCA
|| Results for the random initial value
Experiments showed that the number of far points greatly influenced the embedded results. Without far points, the results were worse for the global information was ignored. The results would be greatly improved while only selecting one far point. The more far points were chosen, the better the results were. However, the results were near while far points were more than five.
Effectiveness of the initial value to the embedded results: In the fast algorithm, we needed to give an initial value of the low dimensional embedding before starting the iteration. Then, how did the initial value affect the results? In this section, we discussed how to set the initial value. Here, we adopted two methods to set the initial value. One was random initial value; the other was to use the results from PCA. Here, k1= 7 and k2 = 1, 3, 5, 10, 20. For each case, we computed the statistic values of ks for 50 executions. The results were given in Table 5 and 6 for two cases.
As shown in the experiments, the initial value greatly affected the embedded results, especially, far points were less. The results were closer and closer with the increasing number of far points for two cases. For the initial value with PCA, the min value, the max value and the mean value of the stress were near while far points was more than five. The standard deviation was also very small. However, the three values were very various and the standard deviation was also bigger for the random initial value.
MDS and ISOMAP which have higher complexity, are typical manifold learning methods. Although LMDS has lower complexity, random landmark point sets often lead to worse results. A fast manifold learning method was proposed which only preserved the similarities or dissimilarities between each point and some points including its nearest neighbors and some far points. Experiments showed that the method could discover the manifold embedded in the high dimensional space and had higher efficiency. Experiments also displayed that the better results would be gained when far points were more than five. Moreover, the initial value also influenced the embedded results. Then, we will check the performance of the algorithm in the real applications.
The study was supported by Research Project of Application Foundation and Advanced Technology of Tianjin of China (No. 10JCZDJC16000).