Feature Selection with a Backtracking Search Optimization Algorithm

—Feature selection carries significance in the outcome of any classification or regression task. Exercising evolutionary computation algorithms in feature selection has led to the construction of efficient discrete optimization algorithms. In this paper, a modified backtracking search algorithm is employed to perform wrapper-based feature selection, where two modifications of the standard backtracking search algorithm are adopted. The first one concentrates on utilizing a particle ranking operator regarding the current population. The second one focuses on removing the case of using a single particle on the mutation process. Then, the implementation of the above algorithm in feature selection is carried out in terms of two general frameworks, which originally were developed for the particle swarm optimization. The first framework is based on the binary and the second on the set-based particle swarm optimization. The experimental analysis shows that the above variants of the backtracking search algorithm perform equally well on the classification of several datasets.


I. INTRODUCTION
Feature selection (FS) is defined as the problem of choosing an optimal subset of features for use in a classification or regression model. It consists of two main components, namely a search technique for proposing new feature subsets and an evaluation function for scoring them.
Evolutionary algorithms have been used extensively to search through the space of possible features. These algorithms are based on three key concepts, namely particles, representing the candidate solutions to the problem, positions, which are the values of the particles at each iteration and velocities, that are directions along which particles' positions are required to change. In the context of FS particles represent different feature combinations.
Recently, Civicioglou et al, proposed a new continuous optimization method, called Backtracking Search Optimization (BSA) [18]. According to authors the algorithm outperformed several popular evolutionary algorithms including PSO on a wide range of tests. In [19], Tsekouras et al, proposed a modified BSA algorithm (MBSA) and successfully used it to train a neuro-fuzzy network for modelling shoreline realignment.
In this study, we investigate the feasibility of employing BSA for combinatorial optimization within the context of feature selection. As a starting point, we develop two variations of the novel binary PSO algorithm (NBPSO) in [20], and the set-based PSO (SBPSO) in [21], where PSO is replaced with MBSA. The new algorithms are compared against their PSO counterparts on feature selection for the classification of datasets from the UCI machine learning repository [22].
The rest of the paper is organized as follows. In section 2, the PSO, BPSO, NBPSO and SBPSO algorithms are briefly summarized. Then BSA and MBSA are reviewed and the proposed binary and set-based versions of discrete MBSA are described. In the fourth section, the performance of the two algorithms is discussed. Conclusions and future work are presented in the last section.

A. Particle Swarm Optimization
The basic variant of PSO [1] starts with a set of randomly generated particle position-velocity pairs. At each iteration the current position, the best known position for each particle i (P pi ) and the best known position among all particles (P g ), are combined into a new particle velocity-position pair. Algorithm 1 lists the pseudo code for the basic PSO algorithm within the context of minimization. Four parameters, namely ω, ϕ p , ϕ g and l r , are used to tune the performance of PSO. The first three control how the current position and velocity, and the personal and global best positions, affect the new velocity, Ppi ← Pi ; 9 if Cost(Pi) < Cost(Pg) then Pg ← Pi ; 10 end for / * Loop until certain termination criteria are met * / 11 while Convergence == false do 12 for i = 1 to N particles do 13 rp, rg ← U (0, 1); 14 Vi ← ωVi + ϕprp ⊙ (Ppi − Pi) + ϕgrg ⊙ (Pg − Pi); 15 Pi ← Pi + lr Vi; 16 if Cost(Pi) < Cost(Ppi) then 17 Ppi ← Pi; 18 if Cost(Pi) < Cost(Pg) then Pg ← Pi ; The first binary PSO algorithm was proposed by Eberhart et al, in [23]. The main idea is to retain the continuous nature of the velocity and introduce the sigmoid function as a means to associate continuous velocities with discrete positions. Then for all features j equation 15 in algorithm 1 becomes where the conditional operator from the C programming language is used to denote the if-else statement, according to the following syntax Intuitively, in BPSO the velocity can be regarded as the probability that a particular position element is 1 or 0. However, such an interpretation is problematic ( [8], [20]). In continuous PSO, velocities direct the current particle position towards the optimum and big absolute values indicate that big change is required, while a zero velocity is indicative of convergence. In BPSO, big velocity values direct, with high probability, position bits to values 0 or 1, thus hindering exploration. On the other hand, a zero velocity causes completely random bit assignments to the new position, which adversely affects exploitation.
A binary PSO algorithm attempting to overcome the drawbacks of the original BPSO was presented in [20]. The proposed algorithm, called Novel Binary PSO (NBPSO), combines the probabilistic interpretation of velocity from BPSO with the physical interpretation as rate of change from the continuous PSO. This is achieved by introducing two intermediate probability 'velocity' vectors, V 1 i and V 0 i , for each particle i, which express the probability that a bit should be flipped (to 1 and 0 respectively).
At each iteration, both vectors are updated according to the intuitive rule that, for a given bit value b (0 or 1) of the j th bit of a best position (personal or global), the probability of setting the corresponding bit of the new position to b should increase, while the probability of setting it to the complement valueb should decrease. Specifically, where In the above, the subscript w takes values p and g and denotes the personal best or the global best position, c w are positive user selected parameters and r w random numbers ∈ [0, 1]. Finally, the new value for every bit is calculated by applying the equation: where the chosen velocity V c i [j] of the jth bit is the one calculated for the complement value to its current value i.e., The set-based algorithm in [21] (SBPSO) takes a different approach to combinatorial optimization with PSO. Here, the position P i of particle i is a member of the power set P (D), where D is the domain of discourse, while the velocity V i is defined as a set of operations, namely element additions and removals, which lead to a new position. Interactions between positions and velocities are directed by the following six operators: 1) Addition of two Velocities, V 1 ⊕ V 2 , is defined as the union of two operation sets and yields a new velocity, Difference of Two Positions, P 1 ⊖ P 2 , is defined as the set of operations, which convert P 2 to P 1 , i.e., addition of elements of P 1 not in P 2 and removal of elements of P 2 not in P 1 , and yields a new velocity and 1 = U (0, 1) < β − ⌊β⌋?1 : 0. 6) Addition of elements outside the union S U = P ∪ P p ∪ P g to position P , β ⊙ + k S U , is defined as the is given from eq. 6. In order to select each new element, a random element from S U is added to the position and the objective function is evaluated. This is repeated k times, and the element which achieved the best score is marked for addition. The whole process is repeated, until all N β,S U elements are selected (ktournament). With these six defined operations the continuous PSO can be readily applied to combinatorial optimization. In particular, the velocity and position update equations (lines 14 and 15 in algorithm 1) become: and where

B. Backtracking Search Optimization
Backtracking Search Optimization (BSA) [18] employs a historic population, i.e., a set of previous positions, to guide the evolution of an initial random population. At each iteration a previous position, which is randomly assigned to each particle, is used to calculate the particle's velocity vector as its difference from the particle's current position. Then, similarly to PSO, a new candidate position (Mutant) is obtained by combining the current position with the calculated velocity. If any mutated elements lie outside the boundaries of allowed values, then they are reset to a random value within this range.
Contrary to PSO, the candidate positions are not directly evaluated. Instead they are further modified by a crossover phase, which aims to increase the diversity of the new trial population. In particular, the mutation of a set of randomly selected elements in the candidate position vector is discarded and those elements remain unchanged. A coin flip decides whether the mutation will be accepted for only one or a larger random number of elements. A second coinflip, at the beginning of each iteration, decides whether to retain the previous historic population or to refresh it by replacing it with the current population. BSA's behaviour is affected by two main parameters, namely the learning rate l r , controlling the effect of the velocity on the new candidate position and the mixrate m r , which influences the amount of mixing between the mutant and the current position. Algorithm 2 lists the pseudo code of BSA.
In [19], Tsekouras et al, argue that the use of the historic population and the lack of a strategy to increase the population diversity in the mutation phase, might cause the standard BSA to have an inferior balance between exploitation and exploration, thus exhibiting poor convergence characteristics, such as slow or premature convergence. As a remedy they propose three modifications. The first modification pertains to the mutation operation and allows the utilization of the particle ranking in the current population. In particular, a probability p ri is assigned to each current position P i , according to the formula A random number, representing a probability threshold, is generated and an individual is randomly picked, out of the subset of particles with probabilities higher than the threshold probability. This step is repeated, until a new rank population P r with N particles is formed. For each particle a velocity vector is calculated as the difference of the current position from the corresponding position in the P r population. The mutation formula (line 10 in algorithm 2) is modified accordingly to include this second velocity, namely where β ∈ U (−1, 1). The second proposed modification applies to the crossover operation, where the case of mutating a single element is removed. Specifically, the m 0 assignment in line 18 of algorithm 2 is replaced with Finally, the boundary control of the standard algorithm is also changed. When a mutated position element is out of  18 (2022) bounds, then instead of being randomly placed within the range of allowed values, it is set to the boundary. Line 13 of algorithm 2 becomes

C. Binary and Set-based Backtracking Search Optimization
The fact that MBSA and PSO share similar update equations (equation 10 and lines 14, 15 in algorithm 1), allows for the construction of binary and set-based variations of NBPSO and SBPSO respectively, by simply replacing the PSO algorithm with MBSA.
Indeed, both updates yield a new position by combining a pair of velocities, defined as the distance of the current position from a previous one. In both algorithms, the previous positions capture good feature combinations, which influence the particle evolution. In the case of PSO, the two velocities utilize the personal and best known positions. In the case of MBSA, the occasional refreshing of the historic population, by replacing it with the current population (line 6 in algorithm 2), implies that the historic population gradually improves. In addition, the rank population contains the best particles from the current population.
Consequently, the incorporation of MBSA in NBPSO and SBPSO is straight forward. Specifically, the PSO personal best and global best positions in equations 2 and 7 are replaced by the MBSA positions from the historic and the rank populations, respectively. Henceforth, we refer to the new variations as NBBSA and SBBSA.

III. EXPERIMENTAL EVALUATION
We compare the new algorithms, against their PSO counterparts, on feature selection for a series of classification problems from the UCI repository. The underlying classifier is a simple neural network consisting of two layers, namely a full rank linear layer with leaky ReLU activation, followed by a second full rank linear layer with Sigmoid and SoftMax activation for the cases of binary and multi class classifications, respectively. Accordingly, cross entropy is used as the loss function.
Initially, all datasets are inspected and columns with little or no information (e.g., ids, large number of missing values) are discarded. Afterwards, a reference run is performed, where the network is trained using all features. The performance of the network is evaluated with either a test set, if it is available, or 10-fold cross validation. Tables I, II and III list information on the datasets, the size and network configuration and the training setup, which were used for the reference run and gave the best reference performance.
For feature selection, the accuracy of the classification is used as the objective function of the optimization. In all cases, the evolving population consists of 16 individuals and the number of epochs was set to 30 (except for Bands with 100 epochs and Autism with 10). All parameters were set to 1.0 and no tuning was performed. Additionally, for the set-based version of discrete PSO and BSA algorithms, k-tournament was not performed as it was found to increase excessively the runtime of the feature selection process. Instead, N β,S U elements out of S U were randomly picked. All runs were carried out on an Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz with 16 threads. The code was written in C and parallel processing was implemented using the OpenMP API [24]. Table V lists the results of the experiments. All four algorithms were successful in both reducing the number of features and improving the classification accuracy with respect to the reference run, even for the datasets with small number of features and/or high reference accuracy. They all performed comparably in terms of feature reduction, accuracy improvement and total runtime. Set-based algorithms appear to have a slight edge as they managed to achieve a bit higher feature reductions, with similar accuracy to their binary counterparts. Similarly, PSO variants tend to select less features, while BSA versions tend to have a bit higher accuracy.

IV. CONCLUSION
In this paper, two algorithms were presented, extending the BSA continuous optimization algorithm to combinatorial optimization, and were applied to the feature selection problem. Specifically, the similarity in the update equation between PSO   and MBSA allowed for the straight forward incorporation of MBSA into two PSO based discrete optimization algorithms, namely NBPSO and SBPSO, by, effectively, replacing PSO with MBSA. Performance of the two new BSA based variations was tested against their PSO counterparts on feature selection for the classification of several datasets from the UCI repository. Although, our analysis is by no means exhaustive, test results indicate that BSA can be a viable alternative to PSO for feature selection and discrete optimization in general.
In the future, we plan to dig deeper into the performance of the presented algorithms both by tuning their internal parameters and trying them with more datasets, classifiers, and problems in general (e.g., regression). Furthermore, the popularity of PSO in discrete optimization offers a large number of algorithms were BSA can be integrated.