Spectral proximal method for solving large scale sparse optimization

In this paper, we propose to use spectral proximal method to solve sparse optimization problems. Sparse optimization refers to an optimization problem involving the l0-norm in objective or constraints. The previous research showed that the spectral gradient method is outperformed the other standard unconstrained optimization methods. This is due to spectral gradient method replaced the full rank matrix by a diagonal matrix and the memory decreased from O(n) to O(n). Since l0-norm term is nonconvex and non-smooth, it cannot be solved by standard optimization algorithm. We will solve the l0 -norm problem with an underdetermined system as its constraint will be considered. Using Lagrange method, this problem is transformed into an unconstrained optimization problem. A new method called spectral proximal method is proposed, which is a combination of proximal method and spectral gradient method. The spectral proximal method is then applied to the l0-norm unconstrained optimization problem. The programming code will be written in Python to compare the efficiency of the proposed method with some existing methods. The benchmarks of the comparison are based on number of iterations, number of functions call and the computational time. Theoretically, the proposed method requires less storage and less computational time.


Introduction
There has been an increased interest in the general field of sparsity in the past few years, [1,2,3]. Solving the sparse optimization to underdetermined linear systems has become a popular research topic in the area of compressive sensing, image processing, machine learning and statistics [4,5]. Sparse optimization is an optimization problem which involves the zero-norm in objective or constraints. The zero-norm on is denoted by 0 -norm or ∥. ∥ 0 , which represents the number of nonzero elements in . The 0 -norm plays an important and crucial role for modelling the sparsity of data and selecting representative variables in optimization problems. Mathematically, some people disagree 0 -norm as a proper norm because it does not satisfy the property of a norm. For all x ∈ and λ ≠ 0, one has ‖λx‖0 = ‖x‖0, which is not being absolutely homogeneous shows that it is not a norm [5]. However, in this paper, we will adopt it as a norm [6]. *Corresponding author: gillianwoo@1utar.my Sparsest solution can be found by applying sparse optimization which involves minimizing of the 0 -norm.
⃦⃦⃦ ⃦⃦⃦ ⃦⃦⃦ ⃦⃦⃦ ⃦⃦⃦ ⃦⃦⃦ ⃦⃦⃦ 0 + (x) , (1) where the loss function, (x) is the data fidelity term which relates to the problems in signal and image processing field [7]. For examples, the least-absolute (LA) loss function, ⃦⃦⃦ ( ) − ⃦⃦⃦ 1 and the least square (LS) loss function, ⃦⃦⃦ ⃦⃦⃦ ⃦⃦⃦ ⃦⃦⃦ ( ) − ⃦⃦⃦ 2 2 . Due to the nonconvexity and discontinuity of the 0 -norm, the resulting optimization problem is known to be intractable (see, e.g., [8] for the NP-hardness of the 0 -minimization over a linear system). The 1 -norm convex regularization is a common approach which replaces ⃦⃦⃦ x ⃦⃦⃦ 0 with the ⃦⃦⃦ ⃦⃦⃦ 1 = ∑ | | =1 , as 1 -norm which is a convex relaxation of 0 -norm [9], especially in compressed sensing. The popularity of the 1 -relaxation is enormous due to the exact recovery property under some conditions [10]. This 1 -norm model has been popularly used in many areas of application. For instance, in the fields of radar systems [11], communications [12], computed tomography (CT) [13] and also magnetic resonant imaging (MRI) [14].
1 -regularizer can be a loose relaxation of the 0 -norm and does not always provide the true relevant variables [15]. In 0 -norm, all nonzero entries have equal contributions. However, 1 -regularizer penalizes the amplitude uniformly and thus, it has a drawback of always underestimates high-amplitude components of . Due to this, reconstruction failures might occur with the least measurements [15,16] and leads to undesirable blocky images [17]. In term of sparsity, 1 -norm and 0 -norm models do not provide a similar performance. Hence, many research in the field of compressive sensing and low-rank matrix recovery suggested that better performances can be achieved by taking better approximations of the 0 -norm and matrix rank [7]. In many applications, the non-convex 0 -norm based regularization has more advantages than the convex 1 -norm, such as in the fields of image restoration [18], bioluminescence [19], CT [17], and MRI reconstruction [20].
In this paper, we propose to solve large scale sparse optimization problems from an optimization angle. The 0 -norm and its constraint are minimized directly with the proposed method. The paper is organised as follows: Section 2 reviews some standard optimization methods for solving unconstrained smooth optimization problems and also proximal method for solving non-smooth problems, followed by Section 3, spectral proximal method is proposed to solve for NP-hardness sparse optimization problems, which combines spectral gradient method and proximal method. In Section 4, we provide and discuss some results of the comparisons between spectral proximal method and the existing methods. An executable programming code has been developed to test the efficiency of the method. A short conclusion for this paper will be made in Section 5.

Standard optimization methods
The unconstrained optimization problem is as follows: where is a twice continuously differentiable function and gradient is denoted by .
In this paper, we are interested in solving the large-scale optimization problems which the Hessian of is either not available or requires a large amount of storage. Gradient methods are usually used to solve large-scale unconstrained convex optimization problem (2) and involve iterative minimization technique for finding the minimizer, + by the following rule: where is the search direction and > 0 is the steplength. The classical steepest descent method, also known as the gradient descent method, was proposed by Cauchy (1847) [21]. The steepest descent method is the simplest gradient method for solving large scale unconstrained optimization. This method moves in a zig-zag like path along the negative direction of gradient towards the local minimum point [22].
Steepest descent algorithm can be implemented easily and only a low storage, O(n) is required [22]. However, this method is relatively slow when close to the minimum. This is because near the local minimization the gradient is nearly zero, and thus the rate of descent is slow [23]. This method is important for the development of the theory of optimization but it is quite slow for most of the real-world problems. Hence, more powerful techniques such as conjugate gradient method and quasi-Newton methods are frequently used [24].
Conjugate gradient method is an iterative method for using conjugate directions having Q-orthogonality. Suppose is a quadratic function ( ) = + + x with variables, where is a symmetric positive definite matrix. Q-orthogonality states that the directions , ,… , are a set of n mutually conjugate vectors with respect to , if = for all ≠ . The minimum of the function will be found using these as directions of search in or less iterations [25]. It surpasses the steepest descent as it avoids the repetitive steps and takes only -orthogonal steps. In general, conjugate gradient algorithm converges faster than basic steepest descent algorithms [26].
Quasi-Newton methods are the algorithms that are used to find the local maxima or minima of a function [27]. It is an improvement version of Newton's method to overcome its main drawback which are the difficulties in evaluating the inverse of the Hessian matrix. Since the inverse of the Hessian matrix is full rank matrix, the computational time and cost are higher. To avoid the complexity in computing the inverse of Hessian matrix, quasi-Newton methods use an approximation to replace an exact inverse Hessian matrix. There are two popular methods in quasi-Newton method: Broyden-Fletcher-Goldfarb-Shann (BFGS) and Davidon-Fletcher-Powell (DFP). In general, quasi-Newton method requires more storage and computational time compared to conjugate gradient method [28].
To overcome the disadvantages of these methods, multiple spectral gradient was proposed by Sim et al. [29] to minimize the usage of storage and computation time.

Spectral gradient method
DFP and BFGS were derived based on Frobenius norm, where = + − and = + − . Notice that the constraint in (4) is the secant equation which is reversed from +1 = and consists of the curvature condition [29].
The inverse of +1 which is denoted by is the approximation of inverse Hessian matrix.
In spite of the DFP and BFGS algorithms avoid the need to compute the inverse of Hessian matrix for , the methods still require ( 2 ) storage for the inverse of Hessian matrix. To overcome this, spectral gradient method was proposed by Sim et al. in 2019 [29]. This method requires only ( ) storage by replacing full rank matrix, +1 with a diagonal matrix, +1 −1 . +1 is the approximation of the eigenvalues of the actual Hessian matrix. To derive an updated scheme for , a restriction for components of under some measures is imposed by minimizing the log-determinant norm and allowing them to satisfy the secant equation in (4). Hence, for any positive matrix , the solution is given by the updated +1 : min tr( +1 ) − ln(det( +1 )) (6) s. t. +1 = .
By substituting (11) into constraint (9) gives: where the Lagrange multiplier can be obtained by applying only one iteration of Newton-Raphson, starting from = 0. When > , the equation (12) has a unique positive solution and hence, the Lagrange multiplier, can be approximated by: .
The algorithm for spectral gradient method is showed as below: Step 1. Set = 0 ; given an initial guessing point , and a convergence tolerance .

Proximal method
Proximal method is not an optimization method. However, we will discuss it in this section as we borrow the proximal method for solving our problem. Here we review some basic ideas of proximal method. The proximal method was introduced by Martinet [32] as a regularization method in the context of convex optimization in Hilbert spaces. Steepest descent, conjugate gradient and Newton's methods are standard tools for solving unconstrained smooth minimization problems of modest size, while proximal algorithms can be viewed as an analogous tool for solving non-smooth, constrained, large-scale, or distributed problems. They are especially well-suited to large or high-dimensional datasets problems. The base operations in classical methods are low-level which consist of linear algebra operations and the computation of gradients and Hessians while the base operation in proximal algorithms is evaluating the proximal operator of a function [33].
The proximal point mapping is the basis of many optimization techniques for convex functions. Recently, proximal algorithm was extended from convex optimization to nonconvex optimization [34]. The non-convex and non-smooth function is a hard problem in optimization, it usually solves with tractable convex optimization based on a regularization or relaxation. Proximal method will be used to solve the non-convex problem directly.

Spectral proximal method
Sparse optimization refers to an optimization problem involving 0 -norm in objective or constraints. In this paper, we propose an efficient, general purpose algorithmic approach and efficient implementations for the following non-convex optimization problem: where = is an underdetermined system. Since 0 -norm is non-convex, nondifferentiable and non-continuous, it cannot be solved by the standard tool which is applicable in solving the unconstrained smooth optimization. The constraint of sparse optimization is in the form = , it is difficult to solve and requires large computation time when it has a high dimensional dataset.
The problem (16) usually arises in machine learning, neural networks, signal processing and bioinformatics. Hence, it motivated us to develop an efficient algorithm to solve for this problem.
Methodology of research is considered into two main parts. Firstly, the proximal method and spectral gradient method programming framework will be reviewed in order to examine how these methods can be extended to solve optimization problems involving 0 -norm and its constraints. The problem is transformed from constrained optimization problem into unconstrained optimization problem by Lagrange method: Instead of solving = , we consider to minimize the residue of = which can be expressed in the form of min Secondly, spectral gradient method is modified and incorporated into the proximal method in solving the sparse optimization problems. Since 0 -norm is non-convex, nondifferentiable and non-continuous, it will be solved by the proximal method. It is very generally applicable, but is especially suitable to problems involving large or highdimensional datasets.
Proximal quasi newton has been proposed by Tang et al. previously [35]. On literature, there is no research has been done on spectral proximal method and proximal steepest descent method. To fill in the gap, we propose to solve the residue by applying spectral gradient method which is proposed by Sim et al. [29]. From the paper by Sim et al. [29], spectral gradient method outperformed than a list of conjugate gradient methods since spectral gradient method requires less storage and less computational time.
Spectral gradient method and proximal method are applied alternatively to solve the problem. The step length, is obtained through Armijo condition as the line search strategy [31].
The algorithm is as follows: Step 1. Initialize parameters. Set = 0; give an initial guess for 0 , set a convergence tolerance , and set the sparsity control parameter > 0. Compute Step 2. Check stopping criteria.
In short, the 0 -norm which is non-smooth is solved by proximal method and the residue which is smooth is solved by spectral gradient method. These two methods are combined to solve the problem and is named as spectral proximal method. Next, an executable code is developed using Python software to test the efficiency of the proposed method with the existing methods. The benchmarks of the comparison are based on number of iterations, number of functions call and the computational time.

Numerical experiments and discussion
In order to compare the efficiency of spectral proximal method with other methods, we developed an executable code using Python 3.7 software. The software is downloaded on Lenovo ideapad 330S, where all the experiments are executed with 8 th Generation Intel Core i7 processor and 12 GB RAM. For all the experiments, we choose the initial point 0 as a column matrix ( × 1) of ones. is × matrix and is × 1 matrix. For proximal method, is set as 10 −8 . The maximum number of iterations is set at 5000. The method is considered failed when the number of iterations has exceeded 5000. Tolerance, is set to be 10 −4 and algorithm is terminated when ⃦⃦⃦ ⃦⃦⃦ ≤ . We generated 500 random matrices with dimensional 7 × 10, 35 × 50, 70 × 100, 140 × 200 and 350 × 500, condition number is set from 1 until 20 with 5 different seeds for each of the dimension. The results from 5 different seeds with same dimensional and condition number's random matrices are averaged. The efficiency of the methods are based on number of iterations, number of functions call and the time consuming. The three methods are steepest descent with proximal (PSD), Broyden-Fletcher-Goldfarb-Shann with proximal (PBFGS) and spectral proximal method (PSG). The profiling graphs are shown below:   1 shows that PSD requires larger number of iterations when compares to the others. This is because it moves in zigzag form towards the local minimum point. It takes more steps and has slower convergence especially when it closes to the minimum point. PBFGS performs slightly better than PSD. Overall, PSG shows the best performance with lesser number of iterations.   2 shows the profiling graph in terms of number of function calls. From the profiling graph above, we can conclude that the PSG method requires the least number of function calls to get the desired solution. On the other hand, the PBFGS method performs slightly better than PSD method. This is mainly due that PBFGS method and PSD method need more function calls in the backtracking line search strategy to obtain the optimum step size while PSG requires less function calls to get the similar results.   3 shows the profiling graph in terms of computational time. PSD method requires more computational time because it moves in zigzag like path and thus has slower convergence. PSG requires less computational time compared to PBFGS due to the fact that in PBFGS, the Hessian matrix is approximated by a full rank matrix with storage memory ( 2 ); while in PSG, the eigenvalues of the Hessian matrix is approximated by a diagonal matrix with storage memory ( ). Hence, PSG requires minimal storage and thus less computational time is needed.

Conclusions
A new method called spectral proximal method is proposed for solving large scale sparse optimization problem. Sparse optimization involves minimizing 0 -norm (a NP-hardness problem) which is non-convex and non-smooth. To solve this problem, we incorporated spectral gradient method with the proximal method. The numerical results show that the spectral gradient method requires less storage and less CPU time, hence this proposed method has a better performance than the other methods.