Theoretical developments for interpreting kernel spectral clus- tering from alternative viewpoints

Theoretical developments for interpreting kernel spectral clustering from alternative viewpoints Diego Peluffo-Ordóñez *1, Paul Rosero-Montalvo1,2, Ana Umaquinga-Criollo1,Luis Suárez-Zambrano1, Hernan Domı́nguez-Limaico1, Omar Oña-Rocha1, Stefany Flores-Armas1, Edgar Maya-Olalla1 1Universidad Técnica del Norte, Facultad de Ingenierı́a en Ciencias Aplicadas, 100150, Ecuador 2Instituto Tecnológico Superior 17 de Julio, Yachay, Ecuador


Introduction
In general, for classifying or grouping a set of objects (represented as data points) into subsets holding similar objects, the field of machine learningspecifically, the pattern recognition-provides two great alternatives being essentially different from each other: Supervised-and Unsupervised-learning-based approaches. The former ones normally establish a model from beforehand known information on data normally provided by an expert, while the latter ones form the groups by following a natural clustering criterion based on a (traditionally heuristic) procedure of data exploration [1]. Therefore, unsupervised clustering techniques are preferred when object labelling is either unavailable or unfeasible. In literature, we can find tens of clustering techniques, which are based on different principles and criteria (such as: distances, densities, data topology, and divergences, among others) [2]. Some remarkable, emerging applications are inbalanced data analysis [3] and timevarying data analysis [4]. Particularly, spectral clustering (SC) is a suitable technique to deal with grouping problems involving hardly separable clusters. Many SC approaches have been proposed, among them: Normalized-cut-based clustering (NCC), which, applied as explained in [5], heuristically and iteratively estimates binary cluster indicators [5] or approximates the solution in a one-iteration fashion by solving a quadratic programming problem [6]. Kernel k-means (KKM) that can be formulated using eigenvectors [7]. Kernel spectral clustering (KSC), which uses a latent variable model and a least-squares-supportvector-machine (LS-SVM) formulation [8]. This work has a particular focus on KSC, being one of the most modern approaches. It has been widely used in numerous applications such as time-varying data [9,10], electricity load forecasting [11], prediction of industrial machine maintenance [12], and among others. Also, some improvements and extensions have been proposed [12][13][14].
The aim of this work is to demonstrate the relationship between KSC and other approaches, namely NCC and KKM. To do so, elegant mathematical developments are performed. Starting from either the primal or dual formulation of KSC, we show clearly the links with the other considered methods. Experimentally, in order to assess the clustering performance, we explore the benefit of each considered method on image segmentation. In this connection, images extracted from the free access Berkeley Segmentation Data Set [15] are used. As a meaningful result of this work, Also, we provide mathematical and experimental evidence of the usability of combining together a LS-SVM formulation and a generic latent variable model for clustering purposes.
The rest of this paper is organized as follows: Section 2 outlines the primal-dual formulation for KSC starting with a LS-SVM formulation regarding a variable model, which naturally yields an eigenvector- Peluffo-Ordóñez et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 2, No. 3, 1670-1676(2017 based solution. Sections 3 and 4 explore and discusses on the links of KSC with NCC and KKM, respectively. Some experimental results are shown in Section 5. Section 6 presents some additional remarks on improved versions of KSC and its relationship with spectral dimensionality reduction. Finally, section 7 draws some final and concluding remarks.

Kernel Spectral clustering
Accounting for notation and future statements, let us consider the following definitions: Define a set of N objects or samples represented by d-dimensional feature vectors. Likewise, consider a data matrix holding all the feature vectors, so that X ∈ R N ×d : vector or data point. KSC is aiming to split X into K disjoint subsets, being K the number of desired groups.

Latent variable model and problem formulation
In the following, the clustering model is described. Let e (l) ∈ R N be the l-th projection vector, which is assumed in the following latent variable form: where w (l) ∈ R d h is the l-th weighting vector, b l is a bias term, n e is the number of considered latent variables, notation 1 N stands for a N dimensional all-ones vector, and the matrix Φ = [φ(x 1 ) ⊤ , . . . , φ(x N ) ⊤ ] ⊤ , Φ ∈ R N ×d h , is a high dimensional representation of data. The function φ(·) maps data from the original dimension to a higher one d h , i.e., φ(·) : R d → R d h . Therefore, e (l) represents the latent variables from a set of n e binary cluster indicators obtained with sign(e (l) ), which are to be further encoded to obtain the K resultant groups.
From the least-squares SVM formulation of equation (1), the following optimization problem can be stated: where γ l ∈ R + is the l-th regularization parameter and V ∈ R N ×N is a diagonal matrix representing the weight of projections.

Matrix problem formulation
For the sake of simplicity, we can express the primal formulation (2) in matrix terms, as follows: where b = [b 1 , . . . , b n e ], b ∈ R n e , Γ = Diag([γ 1 , . . . , γ n e ]), W = [w (1) , · · · , w (n e ) ], W ∈ R d h ×n e , and E = [e (1) , · · · , e (n e ) ], E ∈ R N ×n e . Notations tr(·) and ⊗ denote the trace and the Kronecker product, respectively. By minimizing the previous cost function, the goals of minimizing the weighting variance of E and maximizing the variance of W are reached simultaneously. Let Σ E be the weighting covariance matrix of E and Σ W be the covariance matrix of W. Since matrix V is diagonal, we have that tr((V 1/2 E) ⊤ V 1/2 E) = tr(Σ E ). In other words, Σ E is the covariance matrix of weighted projections, i.e., the projections scaled by square root of matrix V. As well, tr(W ⊤ W) = tr(Σ W ). Then, KSC can be seen as a kernel, weighted principal component analysis (KWPCA) approach [8].

Solving KSC by using a dual formulation
To solve the KSC problem, we form the corresponding Lagrangian of the problem from equation (2) as follows: where matrix A ∈ R N ×n e holds the Lagrange multiplier vectors A = [α (1) , · · · , α (n e ) ], and α (l) ∈ R N is the l-th vector of Lagrange multipliers.
Solving the partial derivatives on L(E, W, Γ , A) to determine the Karush-Kuhn-Tucker conditions, we obtain: Therefore, by eliminating the primal variables from initial problem (2) and assuming a kernel trick such that ΦΦ ⊤ = Ω, being Ω ∈ R N ×N a given kernel matrix, the following eigenvector-based dual solution is obtained: where Λ = Diag(λ), Λ ∈ R N ×N , λ ∈ R N is the vector of eigenvalues with λ l = N /γ l , λ l ∈ R + . Also, taking into account that the kernel matrix represents the similarity matrix of a graph with K connected components as well as V = D −1 where D ∈ R N ×N is the degree matrix defined as D = Diag(Ω1 N ); then the K − 1 eigenvectors contained in A, associated to the largest eigenvalues, are piecewise constant and become indicators of the corresponding connected parts of the graph. Therefore, value n e is fixed D. Peluffo-Ordóñez et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 2, No. 3, 16702, No. 3, -16762, No. 3, (2017 to be K − 1 [8]. With the aim of achieving a dual formulation, but satisfying the condition b ⊤ 1 N = 0 by centering vector b (i.e. with zero mean), the bias term should be chosen in the form Thus, the solution of problem of equation (3) is reduced to the following eigenvector-related problem: where matrix H ∈ R N ×N is the centering matrix that is defined as stands for the kernel function. As a result, the set of projections can be calculated as follows: Once projections are calculated, we proceed to carry out the cluster assignment by following an encoding procedure applied on projections. Because each cluster is represented by a single point in the K − 1-dimensional eigenspace, such that those single points are always in different orthants due also to the KKT conditions, we can encode the eigenvectors considering that two points are in the same cluster if they are in the same orthant in the corresponding eigenspace [8]. Then, a code book can be obtained from the rows of the matrix containing the K − 1 binarized leading eigenvectors in the columns, by using sign(e (l) ). Then, matrix E = sgn(E) is the code book being each row a codeword.

Out-of-sample extension
KSC can be extended to out-of-samples analysis without re-clustering the whole data to determine the assignment cluster membership for new testing data [8].
In particular, defining z ∈ R n e as the projection vector of a testing data point x test , and by taking into consideration the training clustering model, the testing projections can be computed as: where Ω test ∈ R n e is the kernel vector such that and Once, the test projection vector z is computed, a decoding stage is carried out that consists of comparing the binarized projections with respect to the codewords in the code book E and assigning cluster membership based on the minimal Hamming distance [8].

KSC algorithm
Following the pseudo-code (Algorithm 1) to perform KSC is shown.

Multi-cluster spectral clustering (MCSC) from two point of view
In [5], the so-called Multi-cluster spectral clustering (MCSC) is introduced, which is based on the well-known k-way normalized cut-based formulation given by: Expressions (10a) and (10b) are the formulation of the NC optimization problem, named (NCPM). Previous formulation can also be expressed as follows.
Let Ω = D −1/2 ΩD −1/2 be a normalized kernel matrix and L = D 1/2 M be a binary matrix normalized by the square root of the kernel degree. Then, a new NCPM version can be expressed as: where ℓ (k) is the column k of L.
Solution of former problem has been addressed in [5,16] by introducing a relaxed version, in which numerator is maximized subject to denominator is constant, so Indeed, authors assume the particular case L ⊤ L = I K , i.e. letting L be an orthonormal matrix. Then, solution correspond to any K-dimensional basis of normalized matrix eigenvectors. Despite that in [16] it is presented an one-iteration solution for NCPM with suboptimal results avoiding the calculation of SVD per iteration, the omitting of the effect of denominator tr(L ⊤ L) by assuming orthogonality causes that the solution cannot be guaranteed to be a global optimum. In addition, this kind of formulation provide non-stable solutions due to the heuristic search carried out to determine an optimal rotation matrix [16].

Solving the problem by a difference: Empirical feature map
Recalling original problem 3.1, we introduce another way to solve the NCPM formulation via a minimization problem where the aims for maximizing tr(L ⊤ ΩL) and minimizing tr(L ⊤ L) can be accomplished simultaneously, so: where γ = (γ 1 , . . . , γ N ) ⊤ is a vector containing the regularization parameters. Let us assume Ω = ΨΨ ⊤ where Ψ is a N × N dimensional auxiliary matrix, and consider the follow- Previous formulation is possible since kernel matrix Ω is symmetric. Now, let us define h (k) ∈ R N = Ψ ⊤ ℓ (k) as the k-th projection and H = (h (1) , · · · , h (K) ) as the projections matrix. Then, formulation given by (13) can be expressed as follows: where matrix V ∈ R N ×N can be chosen as: We can normalize matrix Ω in such way for all i condition N j ω ij = 1 is satisfied and therefore we would obtain a degree matrix equaling the identity matrix. Then, h (l)⊤ h (l) = tr(H ⊤ H), which corresponds to a PCA-based formulation.
-Diag(v): With v ∈ R N such that v ⊤ v = 1, we have a WPCA approach.
-D −1 : Given the equality V = D −1 , optimization problem can be solved by means of a procedure based on random walks; being the case of interest in this study.

Gaussian processes
In terms of Gaussian processes, variable Ψ represents a mapping matrix such that Ψ = (ψ(x 1 ), . . . , ψ(x N )) and where ψ(·) : R d → R N ) is mapping function, which provides a new N -dimensional data representation where resultant clusters are assumed to be more separable. Also, matrix Ω is to be chosen as a Gaussian kernel [17]. Therefore, according to optimization problem given by (14), term h (k) is to be the k-th projection of normalized binary indicators as h (k) = Ψℓ (k) .

Eigen-solution
We present a solution for 14, which after solving the KKT conditions on its corresponding Lagrangian, an eigenvectors problem is yielded. Then, we first solve the Lagrangian of problem (14) so: where α is a N -dimensional vector containing the Lagrange multipliers.
Solving the partial derivatives to determine the KKT conditions, we have: Eliminating the primal variables, we obtain the following eigenvector problem: where λ = N /γ. Then, matrix ∆ K = (α (1) , · · · , α (K) ) can be computed as the eigenvectors associated with the first K longest eigenvalues of D −1 Ω.

Finally, projections matrix H is in the form
and therefore M = Ψ −1 D −1/2 Ω∆ K , where Ψ can be obtained from a Cholesky decomposition.
Then, within a finite domain, both solution and formulation of NCC can be expressed similarly as done in KSC. So it is demonstrated the relationship between a kernel-based model and Gaussian processes.
Kernel K-means method (KKM) is a generalization of standard K-means that can be seen as a spectral relaxation when introducing a mapping function in the objective function formulation [18]. As mentioned throughout this paper, spectral clustering approaches usually are performed on a lowerdimensional space, keeping the pairwise relationships among nodes. Then, it often leads to a relaxed NPproblems where continuous solutions are obtained by a eigen-decomposition. Such an eigen-decomposition is regarding the normalized similarity matrix (Laplacian, as well). In a Kernel K-means framework, eigenvectors are considered as geometric coordinates and then K-means methods is applied over the eigen-space to get the resultant clusters [19,20].
Previous instance is as follows: Suppose that we have a gray scale matrix m × n pixels in size. Characterizing each image pixel with d features -e. g., color spaces, morphological descriptors-it is yielded as a result a data matrix in the form X ∈ R N ×d , where N = mn. Afterwards, the eigenvectors VR N ×N of a normalized kernel matrix P ∈ R N ×N such that P = D −1 Ω, being Ω the kernel matrix and D ∈ R N ×N the corresponding degree matrix. Then, we proceed to cluster V into K groups using K-means algorithm: q = kmeans(V, K), being q ∈ R N the output cluster indicator such that q i ∈ [K]. The segmented image is then a m × n sized matrix holding regions in accordance with q.
Briefly put, one simple way to perform a KKM procedure is applying k-means over the eigen-space. In Equation (7), the dual formulation is regarding the matrix S = VHΩ where weighting matrix can be chosen as V = D −1 and D is the degree of the data-related graph. Since H causes a centering effect, matrix S is the same as P when kernel matrix Ω is centered. In other words, KKM can be seen as a KSC formulation with an incomplete latent variable model being a noncentered one (with no bias term).

Results and discussion
In order to show how considered methods work, we conduct some experiments to test their clustering ability on segmenting images. To do so, the segmentation performance is quantified by a supervised index noted as Probabilistic Rand Index (PR), explained in [21], such that PR ∈ [0, 1], being 1 when regions are properly segmented. Images are drawn from the free access Berkeley Segmentation Data Set [15]. To represent each image as a data matrix, we characterize the images by color spaces (RGB, YCbCr, LAbB, LUV) and the xy position of each pixel. At the end, data matrix X gathers N pixels represented by d characteristics (variables). To run the experiment, we resize the images at 20% of the original size due to memory usage restrictions. All the methods are performed with a given number of clusters K manually set as shown in shown in Fig. 5 and using the scaled exponential sim-ilarity matrix as described in [19], setting the number of neighbors to be 9.
To test all the methods in a fair scenario, kernelbased methods (KSC and KKM) use Ω as kernel matrix, whereas such a matrix is the affinity matrix for NCC. As well, to perform the clustering procedure, the number of clusters is the same for all the considered methods. As can be readily appreciated, KSC overcome the rest of studied clustering methods. This fact can be attributed to the KSC formulation, which involves a whole latent variable model being in turn incorporated within a LS-SVM framework. Indeed, just like principal component analysis (PCA), KSC optimizes an energy term. Differently, such an energy term is regarding a latent variable instead of directly the input data matrix. Concretely, a latent variable model is used, which is linear and formulated in terms of projections of the input data. The versatility of KSC relies on the kernel matrix required during the optimization procedure of its cost function. Such a matrix holds pairwise similarities, then KSC can be seen as data-driven approach that not only consider the nature of data but yields a true clustering model. It is important to quote that -depending on the difficulty of the segmentation task-data matrices representing images yield features spaces, which may present hardly separable classes. Then, we have demonstrated the benefit of the KSC approach that uses a model along with a LS-SVM formulation -everything within a primal-dual scheme. Other studies have also proven the usability and versatility of this kind of approaches [8,22].

Additional remarks
As explained in [23], KSC performance can be enhanced in terms of cluster separability by optimally projecting original input data and performing the clustering procedure over the projected space. Given the unsupervised nature, spectral clustering becomes very often a parametric approach, involving then a stage of selection/tunning of collection of initial parameters to avoid any local-optimum solution. Typically, the initial parameters are the kernel or similarity matrix and the number of groups. Nonetheless, in some problems when data are represented in a high-dimensional space and/or data-sets are nonlinearly separable, a proper feature extraction may be an advisable alternative. In particular, a projection generated by a proper feature extraction procedure may provide a new feature space wherein the clustering procedure can reach more accurate cluster indicators. In other words, data projection accomplishes a new representation space, where the clustering can be improved, in terms of a given mapping criterion, rather than performing the clustering procedure directly over the original input data.
The work developed in [23] introduces a matrix projection focusing on a better analysis of the structure of data that is devised for a KSC. Since data projection can be seen as a feature extraction process, It is noticeable that KSC overcome the remaining methods. Images are data that traditionally involve highly nonseparable clusters. Therefore, the benefit of using a whole latent variable model within a LS-SVM formulation is verified.
we propose the M-inner product-based data projection, in which the similarity matrix is also considered within the projection framework, similarly as discussed in [24]. There are two main reasons for using data projection to improve the performance of kernel spectral clustering: firstly, the data global structure is taken into account during the projection process and, secondly, the kernel method exploits the information of local structures.
Another study [25] explores the links of KSC with spectral dimensionality reduction from a kernel viewpoint. Particularly, the proposed formulation is LS-SVM in terms of a generic latent variable model involving the projected input data matrix. In order to state a kernel-based formulation, such a projection maps data onto a unknown high-dimensional space. Again, the solution of the optimization problem is addressed through a primal-dual scheme. Finally, once latent variables and parameters are determined, the resultant model outputs a versatile projected matrix able to represent data in a low-dimensional space. To do so, since the optimization is posed under a maximization criterion and dual version has a quadratic from, the eigenvectors associated with the largest eigenvalues can be chosen as a solution. Therefore, the generalized kernel model may represente a weighted version of kernel principal component analysis.

Conclusions
This works explores a widely-recommended method for unsupervised data classification, namely kernel spectral clustering (KSC). From elegant developments, the relationship between KSC and two other well-known spectral clustering approaches (normalized cut clustering and kernel k-means) is demonstrated. As well, the benefit of KSC-like approaches is mathematically and experimentally proved. The goodness of KSC relies on the nature of its formulation, which is based on a latent variable model incorporated into a least-square-support-vector-machine framework. Additionally, some key aspects and hints to improve KSC performance as well as its ability to represent dimensionality reduction approaches are briefly outlined and discussed.
As a future work, a generalized clustering framework is to be designed so that a wide range of spectral approaches can be represented. Doing so, the task of selecting and/or testing a spectral clustering method would become easier and fairer.