Discovering Movie Categories Based on SPHK-means Clustering Algorithm

Basing on SPHK-means, an improved K-means clustering algorithm, we have used dataset provided by MovieLens to design experiment. First, we have reduced dimensions of movie-user scoring matrix. Then, we have multiply sampled movies to conduct agglomerative hierarchical clustering in order to determine the appropriate value of k and initial centers. Finally, according to fixed k and initial centers, we have divided movies into groups through K-means clustering. With evaluation indicators as precision, recall and number of groups found, experiment in this paper has indicated that result of SPHK-means clustering algorithm is better than that of classical K-means clustering algorithm.


Introduction of datasets
MovieLens, a movie recommendation system, has recommended MovieLens Latest Datasets for education and development.Our experiment has used MovieLens Latest Datasets (small) updated at January 16th 2016.
Ratings that 668 users scored 10325 movies have been used as training dataset, a fraction of training dataset are showed in Table 1.Style tags of 10325 movies have been used as testing dataset, a fraction of testing dataset are showed in Table 2.

Implementation of SPHK-means clustering algorithm
Experiment environment is Windows 10.We have used python 2.7 and modules including 'csv', 'numpy', 'scipy', 'math' and 'matplotlib'.The process of our experiment is showed in Figure 1.Major portion of functional codes are specifically as follows.

Evaluation indexes
In statistics, precision, recall and F-score are typical evaluation indexes for classification problem.If result of classification is as Figure 2[3], precision and recall are defined as formula (1) and formula (2).

IST2017
In contrast, we can find it from Table 3 that precision and recall of SPHK-means clustering algorithm are both higher than those of K-means clustering algorithm.In the test data set, movies are actually divided into 19 categories: Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, IMAX, Musical, Mystery, Romance, Sci-Fi, Thriller, War and Western.Table 4 has shown that movie categories found by SPHK-means clustering algorithm are greatly more than those found by classical K-means clustering algorithm.

Conclusion
In this paper, through experiment, we have verified that SPHK-means clustering algorithm is better than classical K-means clustering algorithm in aspect of classification accuracy and number of categories found.
Next, in order to adapt SPHK-means clustering algorithm to more scenarios, we will consider situation that one object may belong to two or more categories which require us to integrate membership degree [4] mentioned in fuzzy mathematics into it.

Figure 1 .
Figure 1.The process of our experiment.

Table 1 .
A fraction of training dataset.

Table 2 .
A fraction of testing dataset.