An Improved K-means Clustering Algorithm Applicable to Massive High-dimensional Matrix Datasets

Since K-means clustering algorithm is easy to implement and high efficient, it has been widely used in cluster analysis of massive datasets. The value of k is difficult to determine in advance and the randomness of choosing initial centers leads to a series of social problems, such as instability, local optimal solution sensitivity to outliers. Results from hierarchical clustering are more natural than those from K-means clustering, but its high time complexity and space complexity makes it difficult to be applied to a large data set. In this paper, through combination of hierarchical clustering and K-means clustering, we have proposed an improved K-means clustering algorithm, and have done experiments using datasets provided by MovieLens.


Introduction
As a method of unsupervised learning, clustering analysis is an important means of data mining.Without knowledge of the data distribution beforehand, clustering is to classified data into groups which are driven by data [1].
In all cluster analysis methods, K-means clustering algorithm has application usability, fast convergence and ability to handle large datasets [2].However, K-means clustering algorithm also has some disadvantages: 1) Because of data distribution, the value of K is difficult to estimate; 2) Randomness of initial centers cause the clustering results often fall into local optimum which is not stable, but not the global optimum; 3) Sensitivity to outliers; 4) Unable to deal with non-spherical clusters and clusters of different sizes and densities.
Results from hierarchical clustering can reflect the hierarchical structure of the dataset.But the time complexity and the space complexity of hierarchical clustering are so high that hierarchical clustering is unsuited for high-dimensional massive dataset.
In this paper, we have integrated principal component analysis and sampling into combination of hierarchical clustering and K-Means clustering, and have proposed a new clustering algorithm named SPHK-means clustering algorithm.

K-means clustering algorithm
K-means is a clustering algorithm based on partition.

Basic algorithm
Basic steps of K-means clustering algorithm are as follows: 1) Selecting initial cluster centers: randomly select k data objects from dataset containing m data objects; 2) Cluster object classification: calculate similarities between every data object and every center, and divide each data object into the group whose center is most similar to it; 3) Calculate new centers: for each group, calculate new center which is mean value of all member data objects in the group; 4) Determine whether the process is to terminate: if continuously twice iteration results in equality, the process stop, or else, proceed to the next iteration and go to step 2).
Space complexity of K-means clustering algorithm is Ο m + k ( ) ⋅ n ( ) [4], and time complexity of it is ( ) m is the number of data objects, n is the number of properties each data object has, t is the number of iterations.

Shortcoming and solutions
There are four main disadvantage of K -Means clustering algorithm: 1) How to determine the optimal value of K; 2) Randomness of initial centers; 3) Sensitivity to outliers; 4) It is difficult to process clusters which is aspherical or with uneven density and size.According to the fact, "the distance from the sample points in the most unlikely point to the same cluster ", D.H.Zhai, etc, used the maximum distance method [5] to select initial cluster centers.Y.Qin, etc [6], generated initial centers on the basis of detecting populated area.J.P.Zhang, etc [7], optimally partitioned data sample space basing on histogram method, and determined value of k and initial centers for k-mean algorithm according to the characteristic distribution of the data sample.

Agglomerative hierarchical clustering
Hierarchical clustering whose results are closer to the natural classification of the data object is an old clustering technology.It include two basic approaches--cohesion and splitting, the latter is more commonly used.

Basic algorithm
Agglomerative hierarchical clustering begins from the initial state that each data point individually becomes a cluster.In each iteration, it merge two clusters between which the distance is shortest as a new cluster.It controls the number of result clusters by setting the threshold of inter-cluster distance, once distances between all pairs of clusters are larger than the threshold set, it stop the process of merging.There are three methods of measuring the distance between clusters: 1) single-linkage; 2) Complete-linkage; 3) Group average.In this paper, we have used the last.Let u and v are respectively two clusters, u and v are respectively the th i member of u and the th j member of v , then Distance between clusters using group average is:

Shortcoming and solutions
As P.N.Tan, etc [4] mentioned, the space complexity of agglomerative hierarchical clustering algorithm is ( ) , and the time complexity of it is ( ) 4], m is the number of data points.Relatively high time complexity and space complexity makes it difficult for hierarchical clustering algorithm to process datasets whose volume is very large.

Principal component analysis
PCA can reduce dimensionality of highly dimensional data and to some extent remove the noise.
Supposing there are p characteristics: X 1 , X 2 ,…, X p .PCA transform original characteristics into new characteristics which is linear combinations of original ones: Meeting these conditions: 1) For each principal component, sum of squares of its factor is 1; 2) all principal components are independent of each other; 3) Variances of principal components must be in descending order, namely, their importance must be in descending order [8]:

Model design of sphk-means clustering algorithm
Let row coordinates represent the objects to be scored, and column coordinates represent users.We The th j user scored the th i object , According to the above analysis, when N and M are large enough, either K- means clustering or agglomerative hierarchical clustering is unfavorable to solve the problem.Model of SPHK-means clustering algorithm summarily proceed in two phases: 1) Pre-cluster by means of integration of sampling, principal component analysis and hierarchical clustering to determine the value of and the initial centers; 2) K-means clustering with and initial centers determined in phase 1) get the final result.Its working process is shown in Figure .1.

Sampling PCA hierarchical clustering
As pre-cluster, this stage in order to determine the value of and the initial centers.

Determine the appropriate value of K
Totally r samples is extracted.The process of dealing with the i th i = 1,2,…,r N objects is extracted as sample from the whole including N objects.PCA reduces dimensions of sample to 1 6 M .Scoring matrix of sample after dimensional reduction is as follow: Use formula (1) to cluster be appropriate range of K.

Determine the appropriate Initial centers
Repeat the above process of "sampling--reducing dimensions--hierarchical clustering", until first appear a 0 , then, the same time, the ultimate centers v 1 ,v 2 ,…,v k 0 { } are the appropriate initial centers.

K-means clustering
After reduce dimensions of N objects to 1 6 M using PCA, cluster N objects by means of K-means clustering algorithm with initial centers v 1 ,v

1 5 N
objects by means of agglomerative hierarchical clustering algorithm.Record the numbers of consequential groups as ( ) k i .Calculate average k and standard deviation ( ) 2 ,…,v k 0