Non-negative Matrix Factorization for Dimensionality Reduction

—What matrix factorization methods do is reduce the dimensionality of the data without losing any important information. In this work, we present the Non-negative Matrix Factorization (NMF) method, focusing on its advantages concerning other methods of matrix factorization. We discuss the main optimization algorithms, used to solve the NMF problem, and their convergence. The paper also contains a comparative study between principal component analysis (PCA), independent component analysis (ICA), and NMF for dimensionality reduction using a face image database.


I. INTRODUCTION
Dimensionality reduction is essential for extracting information from high-dimensional data. For that, PCA and ICA are the famous matrix factorization methods uses for this task. However, for many data sets such as images, text,...etc the original data matrices are non-negative. A factorization such as PCA and ICA contains negative values and is difficult to interpret for some applications. In contrast, non-negative matrix factorization restricts the elements in the data matrix to be non-negative.
The philosophy of NMF was firstly introduced, in a paper published, by Paatero and Tapper in 1994, and popularised by Lee and Seung in 1999 [12]. After that, NMF has gradually become an interesting multidimensional data processing tool to many researchers owing to its ability for giving a natural interpretation of the results (significant results) due to the constraint of the non-negativity.
NMF seeks to find two low-rank matrices (W, H) ∈ R m×r × R r×n non-negative, whose product approximates the non-negative data matrix X ∈ R m×n defined as: Where r is the factorization rank (r << rank(X) << min(m, n)) which selects how many features will be extracted from the data.
More precisely, each data point represented as a row in X can be approximated by an additive combination of nonnegative vectors, which are represented as row in W (see Fig.1). Different to other matrix factorization methods, NMF leads to a part-based representation (i.e, treating objects as a collection of constituent parts), because they allow only additive combinations of the original data. For example, in face recognition, It is interesting to note that the rows of the resulting W are clear parts of human faces (see Fig.1), e.g. nose, ears, and eyes, and these elements will add to each other to recreate the face (original data) [14]. On the other hand, classical factorization methods such as PCA and ICA produces both positive and negative values. Therefore, giving out components that don't offer much interpretability.
Moreover, the remarkable effectiveness of NMF in analyzing nonnegative data has attracted a significant amount of research in many others areas, such as in image processing [6], [19], [4], text mining [3], [21] and source separation problems [15], [7]. Currently, there is ongoing research on NMF to increase its efficiency and robustness.
To this end, the remaining sections of this paper are organized as follows. Section1 discusses the major challenges in solving the NMF problem, as well as the input parameters, including the initialization of the matrix W and H and the factorization rank r. In section2, several optimization algorithms are presented. In the last section, we apply the three methods, PCA, ICA, and NMF, to a face images data.

II. NON-NEGATIVE MATRIX FACTORIZATION A. Problem formulation
The non-negative matrix factorization can be mathematically formulated as a constrained optimization problem below: where W, H ⩾ 0 means that every element of W and H is nonnegative. D(x|y) is a loss function, which is mostly chosen to be Euclidean distance (Euc) 1 2 The choice of the NMF cost function is made according to the type of data to be analyzed. In this article, the Euclidean distance is selected as the objective function: D is a non-convex function in the two variables W and H. It is, therefore, difficult to find the global minima for (P ). Another weakness of the NMF problem is that it is ill-posed (i.e, W and H are non-unique) [17]. For example, given a non-singular matrix A, so,Ŵ = W A ≥ 0 and H = A −1 H ≥ 0. Thereby (Ŵ ,Ĥ) is anther solution pair. To overcome this problem usually, the recherches add prior knowledge of W and H, such as sparseness or orthogonality.
As a consequence, the optimization problem (P) cannot be solved directly. In literature, Block Coordinate Descent (BCD) is the basic framework for all NMF algorithms, based on the idea that the minimization of the cost function can be achieved by minimizing it in one direction at a time (fixes, for example, W and varies H). In this case, the optimization problem becomes convex. To sum up, we seek the solution of the sub-problems (3)(4), in order to determine the solution of the whole problem.
So, all NMF algorithms are solving (P) iteratively, and if they converge, at that time to local minima.
Initialization of W and H is needed, and a good initialization leads to an efficient local solution as illustrated in Fig.2. Due to the sensitivity of this step, researchers have used a variety of initialization methods [22][23] [18], and newly, Flavia Esposito [5] has provided a taxonomy of initialization schemes appearing in the literature. Furthermore, the factorization rank r is another important issue that needs to be fixed. A small value of r may lead to loss of features, and a large value of r, may be responsible for modeling noise. So, the rank r should both reduce the noise in the data and effectively model the key features. The choice of r is generally based on experiments or experience. However, several rank selection techniques have been recently proposed such as [9], [20], [16], [24].
B. Algorithms 1) Projected Gradient Descent: Gradient descent (GD) is a popular optimization algorithm used for solving NMF problem. GD find the minimum of a convex function more quickly by descending, at the current point, in the opposite direction of the gradient of the function.
In our case, the GD is used to solve each sub-problem. We consider the sub-problem (3), the steps of the GD algorithm are as the following: 2) Multiplicative Update: Multiplicative update rules (MUR) is the common NMF algorithm and the most widely used due to their simplicity. Multiplicative methods can be obtained in different ways, either by a heuristic approach, or a Majoration-Minimization (MM) approach. We present them successively below. Lee and seung (1999) [12] were apparently the first to give the heuristic MU, based on the traditional GD algorithm (presented in the previous section) with an adaptive learning rate (step size). To avoid subtraction in the GD update, in [11] proposed to set the learning rate as: W HH T W and H have been derived 1 to minimize the factoriztion error between X and W H. The basic update rules are as follows: ) denotes elementwise multiplication (resp.division).
The MM algorithm solving (5) by two main steps: -1 st step : Construct a surrogate 2 function G k (θ) of C(θ) at the current iterate θ k . 1 More detail on the MU derivation is given by [2]. 2 A surrogate is a function that approximates another function, it has to verify G k (θ) ⩾ C(θ) and G k (θ k ) = C(θ k ).
-2 nd step : Minimize the surrogate to get the next iterate: θ k+1 = arg min θ G k (θ). The MM procedure guarantees the cost is non-increasing (monotonic descent property) at each iteration: We can also observe this property from figure 4.
Building a surrogate function is a very important step for the MM algorithm, and it's not easy. However, several inequalities used in literature which helps in finding this function, including Jensen's Inequality, Convexity Inequality and Cauchy-Schwarz Inequality. In our case where, The construction of the surrogate function is done by the convexity of the function (x ij − x) 2 This suggests the alternating multiplicative updates The update seems simpler in the equivalent matrix form, We notice that the MM update coincides with the heuristic update. The main difference between the MU and PGD algorithms is that, the learning rate of PGD is flexible and that of MU is fixed. Accordingly, MU is slower in convergence than PGD.
3) Alternating Least Squares: The Alternating Least Squares (ALS) algorithm solves the problem (P) iteratively without non-negativity constraint and then projects the solution into a non-negative space. So The update rules are as follows: Generally the ALS algorithm suffers from lack of convergence. Each step of the update is dedicated to finding the optimal solution to the sub-problem, there are many methods used to solve this bounded constraint problem including PGD [13], Active set [10], Projected quasi-Newton [1], and Projected Barzilai-Borwein [8]. Compared with the MU algorithm, ANLS has relatively faster convergence than MU algorithms.

III. METHOD
Comparisons between the three matrix factorization methods (i.e. PCA, ICA, and NMF) were made using a face image database. Before applying such a method, the data were converted into greyscale and centered (Fig.5). Also, the law rank (number of components) in the three methods is fixed as r=3. randomized SVD, FastICA, and MU are the used algorithms for solving PCA, ICA, and NMF respectively. Also, we use nndsvd (nonnegative double singular value decomposition) as initialization method for NMF. The source code is available online 3 .    IV. RESULT AND DISCUSSION The first three components given by PCA, ICA, and NMF are shown in the Fig.6, Fig.7 and Fig.8 respectively. From these figures, we see that PCA was unable to reconstruct the faces because of the noisy components. ICA lost some important features in the faces and the images seem unclear. However, the images extracted from NMF represent more caractéristiques and details of the face. So, the faces are better reconstructed than PCA and ICA.
Moreover, PCA needs 2.018s for training to give the extracted components and ICA 4.032s. In contrast, NMF takes more time which is 10.609s.

V. CONCLUSION
In this work, we describe the NMF problem and its famous algorithms such as MU, PGD, and ANLS. We also compared NMF with PCA and ICA methods using a face image database. PCA and ICA show a significant loss of face information. On the other hand, NMF was able to extract face features and retain more information after reducing the dimensionality.