Chapter 15 The Singular Value Decomposition (SVD)
The Singular Value Decomposition (SVD) is one of the most important concepts in applied mathematics. It is used for a number of application including dimension reduction and data analysis. Principal Components Analysis (PCA) is a special case of the SVD. Let’s start with the formal definition, and then see how PCA relates to that definition.
Definition 15.1 (Singular Value Decomposition) For any \(m\times n\) matrix \(\A\) with \(rank(\A)=r\), there are orthogonal matrices \(\U_{m\times m}\) and \(\V_{n\times n}\) and a diagonal matrix \(\D_{r\times r}=diag(\sigma_1,\sigma_2,\dots,\sigma_r)\) such that
\[\begin{equation}
\tag{15.1}
\A = \U \underbrace{\pm \D & \bo{0} \\\bo{0}&\bo{0} \mp}_{\text{$m\times n$}} \V^T \quad \mbox{with}\quad \sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_r\geq 0
\end{equation}\]
The \(\sigma_i\)’s are called the nonzero singular values of \(\A\). (When \(r<p=\min\{m,n\}\) (i.e. when \(\A\) is not full-rank), \(\A\) is said to have an additional \(p-r\) zero singular values). This factorization is called a singular value decomposition of \(\A\), and the columns of \(\U\) and \(\V\) are called the left- and right-hand singular vectors for \(\A\), respectively.
Properties of the SVD
1. The left-hand singular vectors are a set of orthonormal eigenvectors for \(\A\A^T\).
2. The right-hand singular vectors are a set of orthonormal eigenvectors for \(\A^T\A\).
3. The singular values are the square roots of the eigenvalues for \(\A^T\A \mbox{ and } \A\A^T\), as these matrices have the same eigenvalues.
4. The first singular value is equal to the matrix two-norm:
\[\sigma_1 = \max_{\|\x\|=1} \|\A\x\|_2 = \|\A\|_2\]
5. The Frobenius norm of the matrix is also related to the singular values:
\[\|\A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\sum_{i=1}^r \sigma_i^2}\]
6. Singular values represent distances to lower rank matrices.
\[\sigma_{k+1}=\min_{rank(\bo{B})=k} \|\A-\bo{B}\|_2\]
7. The truncated SVD (Equation (15.3)) provides the closest rank k approximation to our original matrix in the Euclidean sense.
When we studied PCA, one of the goals was to find the new coordinates, or scores, of the data in the principal components basis. If our original (centered or standardized) data was contained in the matrix \(\X\) and the eigenvectors of the covariance/correlation matrix (\(\X^T\X\)) were columns of a matrix \(\V\), then to find the scores (call these \(\mathbf{S}\)) of the observations on the eigenvectors we used the following equation (which is the transpose of Equation (13.3)): \[\X=\mathbf{S}\V^T.\] This equation mimics Equation (15.1) because the matrix \(\V^T\) in Equation (eq:svd) is also a matrix of eigenvectors for \(\A^T\A\). This means that the principal component scores \(\mathbf{S}\) are actually a set of unit eigenvectors for \(\A\A^T\) scaled by the singular values in \(\D\): \[\mathbf{S}=\U \pm \D & \bo{0} \\ \bo{0}&\bo{0} \mp .\]
15.1 Resolving a Matrix into Components
One of the primary goals of the singular value decomposition is to resolve the data in \(\A\) into \(r\) mutually orthogonal components by writing the matrix factorization as a sum of outer products using the corresponding columns of \(\U\) and rows of \(\V^T\):
\[\A = \U \pm \D & \bo{0} \\\bo{0}&\bo{0} \mp\V^T = \pm \u_1 & \u_2 & \dots &\u_m \mp \pm \sigma_1 & 0 & \dots & 0 & 0 \\ 0 & \ddots & 0 & \vdots & 0 \\ \vdots & 0& \sigma_r & 0 & \vdots \\ 0 & 0 & 0 & 0 &0\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 &0 \mp \pm \v_1^T \\ \v_2^T \\ \vdots \\ \v_n^T \mp\]
\[= \sigma_1\u_1\v_1^T+\sigma_2\u_2\v_2^T+\dots+\sigma_r\u_r\v_r^T.\] \[\sigma_1 \geq \sigma_2 \geq \dots \sigma_r\] For simplicity, let \(\Z_i=\u_i\v_i^T\) act as basis matrices for this expansion, so we have \[\begin{equation} \tag{15.2} \A=\sum_{i=1}^r \sigma_i \Z_i. \end{equation}\]
This representation can be regarded as a Fourier expansion. The coefficient (singular value) \(\sigma_i\) can be interpreted as the proportion of \(\A\) lying in the “direction” of \(\Z_i\). When \(\sigma_i\) is small, omitting that term from the expansion will cause only a small amount of the information in \(\A\) to be lost. This fact has important consequences for compression and noise reduction.
15.2 Data Compression
We’ve already seen how PCA can be used to reduce the dimensions of our data while keeping the most amount of variance. The way this is done is by simply ignoring those components for which the proportion of variance is small. Supposing we keep \(k\) principal components, this amounts to truncating the sum in Equation (15.2) after \(k\) terms: \[\begin{equation} \tag{15.3} \A \approx \sum_{i=1}^{k} \sigma_i \Z_i. \end{equation}\] As it turns out, this truncation has important consequences in many applications. One example is that of image compression. An image is simply an array of pixels. Supposing the image size is \(m\) pixels tall by \(n\) pixels wide, we can capture this information in an \(m\times n\) matrix if the image is in grayscale, or an \(m\times 3n\) matrix for a [r,g,b] color image (we’d need 3 values for each pixel to recreate the pixel’s color). These matrices can get very large (a 6 megapixel photo is 6 million pixels).
Rather than store the entire matrix, we can store an approximation to the matrix using only a few (well, more than a few) singular values and singular vectors.
This is the basis of image compression. An approximated photo will not be as crisp as the original - some information will be lost - but most of the time we can store much less than the original matrix and still get a good depiction of the image.
15.3 Noise Reduction
Many applications arise where the relevant information contained in a matrix is contaminated by a certain level of noise. This is particularly common with video and audio signals, but also arises in text data and other types of (usually high dimensional) data. The truncated SVD (Equation (15.3)) can actually reduce the amount of noise in data and increase the overall signal-to-noise ratio under certain conditions.
Let’s suppose, for instance, that our matrix \(\A_{m\times n}\) contains data which is contaminated by noise. If that noise is assumed to be random (or nondirectional) in the sense that the noise is distributed more or less uniformly across the components \(\Z_i\), then there is just as much noise “in the direction” of one \(\Z_i\) as there is in the other. If the amount of noise along each direction is approximately the same, and the \(\sigma_i\)’s tell us how much (relevant) information in \(\A\) is directed along each component \(\Z_i\), then it must be that the ratio of “signal” (relevant information) to noise is decreasing across the ordered components, since \[\sigma_1 \geq \sigma_2\geq \dots \geq \sigma_r\] implies that the signal is greater in earlier components. So letting \(SNR(\sigma_i\Z_i)\) denote the signal-to-noise ratio of each component, we have \[SNR(\sigma_1\Z_1) \geq SNR(\sigma_2\Z_2)\geq \dots \geq SNR(\sigma_r\Z_r)\]
This explains why the truncated SVD, \[\A \approx \sum_{i=1}^{k} \sigma_i \Z_i \quad \mbox{where}\quad k<r\] can, in many scenarios, filter out some of the noise without losing much of the significant information in \(\A\).