Improving the Classification Quality of the SVM Classifier for the Imbalanced Datasets on the Base of Ideas the SMOTE Algorithm

. The approach to the classification problem of the imbalanced datasets has been considered. The aim of this research is to determine the effectiveness of the SMOTE algorithm, when it is necessary to improve the classification quality of the SVM classifier, which is applied for classification of the imbalanced datasets. The experimental results which demonstrate the improvement of the SVM classifier quality with application of ideas the SMOTE algorithm for the imbalanced datasets in the sphere of medical diagnostics have been given.


Introduction
The problem of imbalanced data is one of the main problems which must be solved before the application of some machine learning classification algorithm if we want to receive the high quality classification results.
Dataset is called imbalanced if the samples size from one class is very much smaller or larger than the other classes.
Training the classifiers for the imbalanced datasets compromises the performance of most well-known machine learning algorithms.It is fair, in particular, for the support vector machine algorithm (SVM, Support Vector Machine) [1 -9].
The incorrect classification of objects of the minority class usually costs significantly more than the incorrect classification of the object of the majority class since the minority class instances are rare, but the most important data used in real data sets.
As shown by experimental studies, the training of the classifiers on the imbalanced datasets leads to the fact that the constructed classifier tend to classify all objects as objects of the majority class, completely ignoring the underrepresented minority class, which generally does not correspond to the actual purpose of the research [3 -6].
There are a significant number of real-world applications that are suffering from the class imbalance problem (for example, medical and fault diagnostics, anomaly detection, face recognition, telecommunication, the web and email classification, ecology, biology and financial services.For example, in medical diagnostics the number of sick patients is usually significantly less then the number of healthy people. Currently, the different strategies of sampling are applied to solve the problem of the imbalance datasets.In this paper the study of the aspects of the applicability of the sampling strategies to restore the balance between the classes in the problem of binary classification has been performed.In particular, the capabilities of the synthetic sampling algorithm called as the SMOTE (Synthetic Minority Oversampling Technique) [3] have been investigated.

The basic principles of the SVM classifier development
The SVM algorithm proposed by Vapnik [1,4] is a modern approach for solving the pattern recognition problems.The SVM algorithm maps the sample points into a highdimensional feature space to seek for an optimal separating hyperplane through maximizing the margin between two classes.
The Support Vector Machine (SVM) algorithm is the supervised machine learning algorithm [1 -9].The SVM algorithm is successfully used for the different classification problems in various applications [8].The SVM classifiers on the base of the SVM algorithm have been applied for credit risk analysis, medical diagnostics, text categorization, information extraction, etc [8].
To develop the best SVM classifier it is necessary to find correctly the kernel function type, values of the kernel function parameters and value of the regularization parameter [8].The solution of this problem can be achieved by the grid search of the kernel function types, values of the kernel function parameters and value of the regularization parameter that demands significant computational expenses.Quality of the SVM classifier can be measured by different classification quality indicators.There are the cross validation data indicator, the accuracy indicator, the classification completeness indicator and the ROC curve analysis based indicator, etc [8].
The SVM classifier with satisfactory training and testing results can be used to classify new objects.
The separating hyperplane for the objects from the training set can be represented by equation , where w is a vector-perpendicular to the separating hyperplane; b is a parameter which corresponds to the shortest distance from the origin of coordinates to the hyperplane; z w, is a scalar product of vectors w and z .
The condition specifies a strip that separates the classes.The wider the strip, the more confidently we can classify objects.The objects closest to the separating hyperplane, are exactly on the bounders of the strip.
Finding the separating hyperplane is basically the dual problem of searching a saddle point of the Lagrange function, which reduces to the problem of quadratic programming, containing only dual variables [8].
In training of the SVM classifier it is necessary to determine the kernel function type ) , ( , values of the kernel parameters and value of the regularization parameter C , which allows finding a compromise between maximizing of the gap separating the classes and minimizing of the total error [8]. One of the approaches using for the search of the optimal values of the parameters of the SVM classifier is based on the application of the Particle Swarm Optimization algorithm (PSO algorithm) [7 -9].
Search space in the PSO algorithm is filled with a population of particles each of which has some location and velocity in the space of the problem parameters at the concrete moment of time.
Corresponding value of the objective function is calculated for each particle location.Particle location and velocity is changed after calculation of a new value of the objective function.
After every iteration under determination of the following particle location, information on the best particle from a number of neighboring particles (particles can share information) and also information on this particle location during that iteration when the best value of the objective function corresponds to this particle (particles have "memory"), are taken into account [7].
In this research we used the canonical version of the PSO algorithm [8].

The basic ideas of the SMOTE algorithm
A dataset is imbalanced if the classes are not approximately equally represented.
The SMOTE algorithm [3] creates the artificial objects of the minority class based on the similarities in the feature space between the existing objects using the k-nearest neighbor algorithm (kNN alrorithm) (Fig. 1) [10].Herewith, the number of the artificial objects which are "similar" to the objects of the existing minority class, but do not duplicate them is generated.
The SMOTE technique is an important approach by oversampling the minority class.

The experimental study
In this research, we implemented the following approach to application of the SMOTE to improve the classification quality of the SVM classifier for imbalanced datasets.
Step 1.The initial dataset is splited into the train and test datasets.
Step 2. The new synthetic objects are generated by the SMOTE algorithm for each dataset obtained in step 1.
Step 3. The SVM algorithm is implemented for the new received datasets.Herewith, the search for the optimum parameters values of the SVM classifier is implemented by the PSO algorithm (in particular, the PSO algorithm is used for the search of two optimal parameters values of the SVM classifier for the radial basis kernel function: the values of the parameter regularization С and the kernel function parameter V ) [8].
Python 2.7 was used for software implementation of the SMOTE and SVM algorithms.Herewith, the default settings were applied for the SMOTE algorithm.
We use the real medical datasets from the UCI repository of the machine learning database [4] to demonstrate the classification performance of the approach proposed in this paper.These free medical datasets are "Heart", "Hepatitis" and "Pima diabetes".In each of them, the positive class consists of the data corresponding to the healthy, normal, or benign cases, while the negative class contains the data for the diseased, abnormal, or malignant cases.Further details of these datasets are provided in Tabl. 1.
In Tabl. 1 the value of imbalance ratio (Ratio) was calculated by the following formula: , where 1 a is the number of objects in the minority class; 2 a is the number of objects in the majority class.The results of application of the SVM-PSO algorithm and the Smote-SVM-PSO algorithm are shown in Tabl.2, Tabl.3, and Tabl. 4 (with the following type of algorithms: 1 is the SVM-PSO algorithm, 2 is the Smote-SVM-PSO algorithm).
For the "Heart" dataset the class imbalance is not obviously shaped, therefore, the results of the SVM-PSO algorithm is not very different from the Smote-SVMPSO algorithm.Herewith, Tabl. 1 shows the results for the "Heart" dataset for two cases depending on the choice of datasets for training and testing.
We can say that the effectiveness of the SVM classifier can be improved indeed when the structure of the data is taken into consideration.
For the other datasets the class imbalance is considerable, therefore the SVM-PSO algorithm concedes to the Smote-SVM-PSO algorithm, as the results of the SVM-PSO algorithm are characterized by the low values of accuracy, specificity and sensitivity.These results justify the fact that standard SVM algorithm are sensitive to the class imbalance problem.
The obtained results correspond to the implementation of the SMOTE algorithm with the default parameters values used in the Python library.Also, we suggested the searching algorithm for the optimum parameters values of the SMOTE algorithm.In particular, we considered two parameters: the number k of nearest neighbours to used to construct synthetic samples; the number m of nearest neighbours to use to determine if a minority sample is in danger.
The suggested searching algorithm can be described by the following sequence of steps.
Step 2. To build for each pair ) , ( j i m k n SVM classifiers using the SMOTE algorithm for the imbalanced data (that is to apply the SMOTE algorithm for each pair ) , ( j i m k with equal probability).
Step 3. To evaluate the classification quality of the developed SVM classifiers and save the obtained SVM classifiers.To find the best SVM classifier, if the maximum value of iteration is achieved, and finish the algorithm.Otherwise, to go to step 4.
Step 4. To estimate the average classification quality of the SVM classifiers using, for example, the F-measure indicator for each pair ) , ( • to find the total sum g ij S of the classification quality of the SVM classifiers for each pair ) , ( j i m k , obtained to the current number g of iteration of the suggested algorithm; • to find the ratio for each pair ) , ( It is necessary to say, that we generate the different balanced datasets, using the random number generator for each pair ) , ( , therefore, the developed SVM classifiers will be differ from each other.The offered algorithm allows minimizing the time expenditures for the search of the optimal parameters values of the SMOTE algorithm, and, hence, for development of the SVM classifier.

Conclusion
The experimental results show that the SMOTE algorithm improves the classification quality of the SVM classifiers for the imbalanced data.Herewith, for the datasets with the high imbalance, the effectiveness of the SMOTE algorithm is very high.
The offered algorithm allows minimizing the time expenditures for the search of the optimal parameters values of the SMOTE algorithm.
Further, we plan to consider the values of several classification quality indicators simultaneously at the choice of the optimal parameters values of the SMOTE algorithm.

Table 1 .
The characteristics of the datasets.

Table 2 .
The classification results on the "Heart" dataset.

Table 3 .
The classification results on the "Hepatitis" dataset.

Table 4 .
Classification results on the "Pima diabetes" dataset.