Linear Algebra

Work in progress

1 Motivating linear algebra

Même le feu est régi par les nombres.

Fourier (1768-1830) studied the transmission of heat using tools that would later be called an eigenvector-basis. Why would he say something like this?

To understand eigenvalues we will have to establish a few basics first. In the following sections lowercase letters denote scalars, lowercase boldface letters vertors, and uppercase boldface letters denote matrices.

2 Matrices vectors and operations

A useful way to look at matrices through equation systems, consider the system with m equations and n variables below,

\begin{equation} a_{11} x_1 + \dots + a_{1n} x_n = b_1 \\ \label{eq:eqn_system} a_{21} x_2 + \dots + a_{2n} x_2 = b_2 \\ \vdots \\ a_{m1} x_1 + \dots + a_{mn} x_n = b_m \\ \end{equation}

We can arrange the coefficients \( a_{ij}\) into a matrix \(\mathbf{A} \in \mathbb{R}^{m,n}\) as a real-valued Matrix with \(m\) rows and \(n\) columns.

\begin{equation} \mathbf{A} = \begin{pmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & \vdots & & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} \end{pmatrix} \end{equation}

Similarly we can place the \(b\)s and \(x\)s in vectors \(\mathbf{b} \in \mathbb{R}^{m}\) and \(\mathbf{x} \in \mathbb{R}^{n}\), respectively.
Using matrix multiplication, we can rewrite the system \((\ref{eq:eqn_system})\) as:

\begin{equation} \mathbf{A} \mathbf{x} = \mathbf{b}. \end{equation}

Next we will consider operations with matrices. Starting with the matrix operation, that allowed us to rearrange the equation system.

2.1.1 Matrix Multiplication

Multiplying \(\mathbf{A} \in \mathbb{R}^{m,n}\) by \(\mathbf{B} \in \mathbb{R}^{n,p}\) produces \(\mathbf{C} \in \mathbb{R}^{m,p}\) (Gellert et al. (1979)),

\begin{equation} \mathbf{A} \mathbf{B} = \mathbf{C}. \end{equation}

To compute \(\mathbf{C}\) the elements in the rows of \(\mathbf{A}\) are multiplied with the column elements of \(\mathbf{C}\) and the products added,

\begin{equation} c_{ik} = \sum_{j=1}^{n} a_{ij} \cdot b_{jk}. \end{equation}

The matrix product exists is the number of \(\mathbf{A}\)'s columns equals the number of \(\mathbf{B}\)'s rows. When using numpy use @ for matrix multiplication.

2.1.2 Matrix-scalar multiplication

Multiplication of a Matrix \(\mathbf{A} \in \mathbb{R}^{m,n}\) with a real number \(s \in \mathbb{R}\) means multiplication of every matrix element with the scalar,

\begin{equation} s \mathbf{A} = \begin{pmatrix} s a_{11} & s a_{12} & \dots & s a_{1n} \\ s a_{21} & s a_{22} & \dots & s a_{2n} \\ \vdots & \vdots & & \vdots \\ s a_{m1} & s a_{m2} & \dots & s a_{mn} \end{pmatrix} . \end{equation}

2.1.3 Addition

Two matrices \(\mathbf{A} \in \mathbb{R}^{m,n}\) and \(\mathbf{B} \in \mathbb{R}^{m,n}\) can be added by adding their elements

\begin{equation} \mathbf{A} + \mathbf{B} = \begin{pmatrix} a_{11} + b_{11} & a_{12} + b_{12} & \dots & a_{1n} + b_{1n} \\ a_{21} + b_{21} & a_{22} + b_{22} & \dots & a_{2n} + b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} + b_{m1} & a_{m2} + b_{m2} & \dots & a_{mn} + b_{mn} \end{pmatrix}. \end{equation}

2.1.4 The Transpose

The transpose operation flips matrices along the diagonal, for example, in \(\mathbb{R}^2\),

\begin{equation} \begin{pmatrix} a & b \\ c & d \\ \end{pmatrix}^T = \begin{pmatrix} a & c \\ b & d \\ \end{pmatrix} \end{equation}

Before we consider additional operations, lets take a minute to take a look at a very important matrix.

2.1.5 The identity matrix

\begin{equation} \mathbf{I} = \begin{pmatrix} 1 & & & \\ &1& & \\ & & \ddots &\\ & & &1\\ \end{pmatrix} \end{equation}

The identity appears if we multiply two special matrices,

\begin{equation} \begin{pmatrix} -1 & 0 & 0 \\ 1 &-1 & 1 \\ 1 & 0 &-1 \end{pmatrix} \begin{pmatrix} -1 & 0 & 0 \\ -2 &-1 &-1 \\ -1 & 0 &-1 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}. \end{equation}

2.1.6 Matrix inverse

The inverse Matrix \(\mathbf{A}^{-1}\) undoes the effects of \(\mathbf{A}\), or in mathematical notation,

\begin{equation} \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}. \end{equation}

The process of computing the inverse is called Gaussian elimination.

Consider the example:

\begin{equation} \mathbf{A} = \begin{pmatrix} 2 & 0 \\ 1 & 3 \end{pmatrix} \rightarrow \begin{pmatrix} 2 & 0 & | & 1 & 0 \\ 1 & 3 & | & 0 & 1\\ \end{pmatrix} \rightarrow \begin{pmatrix} 1 & 0 & | & \frac{1}{2} & 0 \\ 1 & 3 & | & 0 & 1\\ \end{pmatrix} \\ \rightarrow \begin{pmatrix} 1 & 0 & | & \frac{1}{2} & 0 \\ 0 & 3 & | & -\frac{1}{2} & 1\\ \end{pmatrix} \rightarrow \begin{pmatrix} 1 & 0 & | & \frac{1}{2} & 0 \\ 0 & 1 & | & -\frac{1}{6} & \frac{1}{3}\\ \end{pmatrix} \end{equation}

We can check the result:

\begin{equation} \begin{pmatrix} 2 & 0 \\ 1 & 3 \\ \end{pmatrix} \begin{pmatrix} \frac{1}{2} & 0 \\ -\frac{1}{6} & \frac{1}{3}\\ \end{pmatrix} = \begin{pmatrix} 2 \cdot \frac{1}{2} + 0 \cdot -\frac{1}{6} & 2 \cdot 0 + 0 \cdot \frac{1}{3} \\ 1 \cdot \frac{1}{2} + 3 \cdot -\frac{1}{6} & 0 \cdot 0 + 3 \cdot \frac{1}{3} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ \end{pmatrix} \end{equation}

The inverse exists for regular matrices. A Matrix \(\mathbf{A}\) is regular if \(\det(\mathbf{A}) \neq 0\). Whe will consider the determinant or \(\det\). in short in the next section.

2.1.7 The determinant

The determinant contains lots of information about a matrix in a single number. When a matrix has a zero determinant, a column is a linear combination of other columns. Its inverse does not exist. We require determinants to find eigenvalues by hand.

Computing determinants in two or three dimensions The two-dimensional case:

\begin{equation} \begin{vmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ \end{vmatrix} = a_{11} \cdot a_{22} - a_{12} \cdot a_{21} \\ \end{equation}

Computing the determinant of a three-dimensional matrix.

\begin{equation} \begin{vmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{vmatrix} = a_{11} \cdot \begin{vmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \\ \end{vmatrix} - a_{21} \cdot \begin{vmatrix} a_{12} & a_{13} \\ a_{32} & a_{33} \\ \end{vmatrix} + a_{31} \cdot \begin{vmatrix} a_{12} & a_{13} \\ a_{22} & a_{23} \\ \end{vmatrix} \end{equation}

Works for any row or column, as long as we respect the sign pattern.

Example computation:

\begin{equation} \begin{vmatrix} -1 & 0 & 0 \\ 1 &-1 & 1 \\ 1 & 0 &-1 \end{vmatrix} = -1 \cdot \begin{vmatrix} -1 & 1 \\ 0 & -1 \\ \end{vmatrix} - 1 \cdot \begin{vmatrix} 0 & 0 \\ 0 & -1 \\ \end{vmatrix} + 1 \cdot \begin{vmatrix} 0 & 0 \\ -1 & 1 \\ \end{vmatrix} \\ = (-1) \cdot ((-1) \cdot (-1) - 0 \cdot 1)) - \\ (0 \cdot (-1) - 0 \cdot 0) + 0 \cdot 1 -(-1) \cdot 0 \\ = -1 \end{equation}

In numpy call np.det. We can compute determinants in \(n\)-dimensions, but do not require this here.

3 Linear curve fitting

What is the best line connecting measurements? Consider the points below?

line_through_noisy_signal

How do we find the line closest to all of these points?

3.1 Problem Formulation

A line has the form \(f(a) = da + c\), with \(c,a,d \in \mathbb{R}\). In matrix language, we could ask for every point to be on the line,

\begin{equation} \begin{pmatrix} \label{eq:line_system} 1 & a_1 \\ 1 & a_2 \\ 1 & a_3 \\ \vdots & \vdots \\ 1 & a_m \\ \end{pmatrix} \begin{pmatrix} c \\ d \\ \end{pmatrix} = \begin{pmatrix} p_1 \\ p_2 \\ \vdots \\ p_m \end{pmatrix}. \end{equation}

We can treat polynomials as vectors, too! The coordinates populate the matrix rows in \(\mathbf{A} \in \mathbb{R}^{m \times 2}\), and the coefficients appear in \(\mathbf{x} \in \mathbb{R}^{2}\), with the points we would like to model in \(\mathbf{b} \in \mathbb{R}^{m}\). The problem now appears in matrix form and can be solved using linear algebra!

3.2 The Pseudoinverse

The inverse exists for square or \(n\) by \(n\) matrices. Nonsquare \(\mathbf{A}\) such as the one we just saw, require the pseudoinverse (Strang et al. (2009),Deisenroth et al. (2020)),

\begin{equation} \mathbf{A}^{\dagger} = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T . \end{equation}

Sometimes solving \(\mathbf{A}\mathbf{x} - \mathbf{b} = 0\) is impossible, the pseudoinverse considers,

\begin{equation} \min_x \dfrac{1}{2}|\mathbf{A}\mathbf{x} - \mathbf{b}|^2 \\ \end{equation}

instead. \(\mathbf{A}^{\dagger} \mathbf{b} = \mathbf{x}\) yields the solution.

Sometimes solving \(\mathbf{A}\mathbf{x} + \mathbf{b} = 0\) is implossible. But we can do our best to approximate a best fit.

\begin{equation} \min_x \dfrac{1}{2}|\mathbf{A}\mathbf{x} - \mathbf{b}|^2 \\ \end{equation}

We expect the gradient to vanish at the optimum,

\begin{equation} 0 = \nabla_x \dfrac{1}{2}|\mathbf{A}\mathbf{x} - \mathbf{b}|^2 \end{equation}

\begin{equation} = \nabla_x \dfrac{1}{2}(\mathbf{A}\mathbf{x} - \mathbf{b})^T(\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T\mathbf{A}\mathbf{x} - \mathbf{A}^T\mathbf{b} \end{equation}

\begin{equation} \rightarrow \mathbf{A}^T\mathbf{b} = \mathbf{A}^T\mathbf{A}\mathbf{x} \end{equation}

\begin{equation} (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b} = \mathbf{x} \end{equation}

This is not a rigorous proof, but it helps us motivate the pseudoinverse.

3.2.1 Linear regression

The pseudoinverse delivers a best possible fit solution to the problem described by equation system \(\ref{eq:line_system}\). The plot below illustrates the solution.

regression

3.2.2 What about harder problems?

Often lines are not enough to model the underlying dynamics of a problem. Consider the signal below:

signal_complex

How do we come up with a mathematical model for this problem?

3.2.3 Fitting higher order polynomials

A possible solution is to move from a line model to polynomials. We can set up an equation system below with monomials like \(x^n\),

\begin{equation} \underbrace{ \begin{pmatrix} 1 & a_1^1 & a_1^2 & \dots & a_1^n \\ 1 & a_2^1 & a_2^2 & \dots & a_2^n \\ 1 & a_3^1 & a_3^2 & \dots & a_3^n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & a_m^1 & a_m^2 & \dots & a_m^n \\ \end{pmatrix} }_{\mathbf{A}} \underbrace{ \begin{pmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{pmatrix} }_{\mathbf{x}} = \underbrace{ \begin{pmatrix} p_1 \\ p_2 \\ \vdots \\ p_m \end{pmatrix} }_{\mathbf{b}} . \end{equation}

As we saw for the linear regression \(\mathbf{A}^{\dagger}\mathbf{b} = \mathbf{x}\) gives us the coefficients.

3.3 Overfitting

The figure below depicts the solution for a polynomial of 7th degree, that is \(n=7\).

overfitting

We saw how linear algebra lets us fit polynomials to curves. For the 7th-degree polynomial noise dominates the model partially! What now?

3.4 Regularization

Is there a way to fix the previous example? To do so we start with a rather peculiar observation.

3.5 Eigenvalues and Eigen-Vectors

Multiply matrix \(\mathbf{A}\) with vectors \(\mathbf{x_1}\) and \(\mathbf{x_2}\),

\begin{equation} \mathbf{A} = \begin{pmatrix} 1 & 4 \\ 0 & 2 \end{pmatrix}, \mathbf{x_1} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \mathbf{x_2} = \begin{pmatrix} 4 \\ 1 \end{pmatrix}, \end{equation}

we observe

\begin{equation} \mathbf{A}\mathbf{x_1} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \mathbf{A}\mathbf{x_2} = \begin{pmatrix} 8 \\ 2 \end{pmatrix} \end{equation}

Vector \(\mathbf{x_1}\) has not changed! Vector \(\mathbf{x_2}\) was multiplied by two. In other words,

\begin{equation} \mathbf{A}\mathbf{x_1} = 1 \mathbf{x_1}, \mathbf{A}\mathbf{x_2} = 2 \mathbf{x_2} \end{equation}

Eigenvectors turn multiplication with a matrix into multiplication with a number,

\begin{equation} \mathbf{A}\mathbf{x} = \lambda \mathbf{x} . \end{equation}

Subtracting \(\lambda \mathbf{x}\) leads to,

\begin{equation} (\mathbf{A} - \lambda \mathbf{I}) \mathbf{x} = 0 \end{equation}

The interesting solutions are those were \(\mathbf{x} \neq 0\), which means

\begin{equation} \det(\mathbf{A} - \lambda \mathbf{I}) = 0 \end{equation}

Why do we want the identity in \(\mathbf{A} - \lambda \mathbf{I}\)? Because we are not allowed to subtract numbers from matrices.

Lets compute the eigenvalues and vectors for the initial example.

\begin{equation} \mathbf{A} = \begin{pmatrix} 1 & 4 \\ 0 & 2 \end{pmatrix} \rightarrow \begin{vmatrix} 1- \lambda & 4 \\ 0 & 2 - \lambda \end{vmatrix} = (1-\lambda)*(2-\lambda) - 0*4 = 0 \\ \rightarrow \lambda_1 = 1, \lambda_2 = 2 . \\ \begin{pmatrix} 1 - 1 & 4 \\ 0 & 2 - 1 \end{pmatrix} \mathbf{x}_1 = \begin{pmatrix} 0 & 4 \\ 0 & 1 \end{pmatrix} \mathbf{x}_1 = 0 \rightarrow \mathbf{x}_1 = \begin{pmatrix} p \\ 0 \end{pmatrix} \text{ for } p \in \mathbb{R} \\ \begin{pmatrix} 1 - 2 & 4 \\ 0 & 2 - 2 \end{pmatrix} \mathbf{x}_2 = \begin{pmatrix} -1 & 4 \\ 0 & 0 \end{pmatrix} \mathbf{x}_2 = 0 \rightarrow \mathbf{x}_2 = \begin{pmatrix} q \\ \frac{1}{4}q \end{pmatrix} \text{ for } q \in \mathbb{R} \end{equation}

Determinant not useful numerically, software packages use QR-Method.

3.5.1 Eigenvalue Decomposition x

Eigenvalues let us look into the heart of a square system-matrix \(\mathbf{A} \in \mathbb{R}^{n,n}\) (Strang et al. (2009)).

\begin{equation} \mathbf{A} = \mathbf{S}\begin{pmatrix} \lambda_1 & & & \\ & \lambda_2 & & \\ & & \ddots & \\ & & & \lambda_n \\ \end{pmatrix} \mathbf{S}^{-1} =\mathbf{S}\Lambda \mathbf{S}^{-1}, \end{equation}

with \(\mathbf{S} \in \mathbb{C}^{n,n}\) and \(\Lambda \in \mathbb{C}^{n,n}\).

3.5.2 Singular Value-Decomposition

SVD Strang et al. (2009) What about a non-square matrix \(\mathbf{A} \in \mathbb{R}^{m,n}\)? Idea:

\begin{equation} \mathbf{A^T}\mathbf{A} = \mathbf{V} \begin{pmatrix} \sigma_1^2 & & \\ & \ddots & & \\ & & \sigma_n^2 \end{pmatrix} \mathbf{V}^{-1}, \mathbf{A}\mathbf{A^T} = \mathbf{U} \begin{pmatrix} \sigma_1^2 & & \\ & \ddots & & \\ & & \sigma_m^2 \end{pmatrix} \mathbf{U}^{-1}. \end{equation}

Using the eigenvectors of the \(\mathbf{A}^T\mathbf{A}\) and \(\mathbf{A}\mathbf{A}^T\) we construct,

\begin{equation} \mathbf{A} = \mathbf{U}\Sigma \mathbf{V}^T, \end{equation}

with \(\mathbf{A} \in \mathbb{R}^{m,n}\), \(\mathbf{U} \in \mathbb{R}^{m,m}\), \(\Sigma \in \mathbb{R}^{m,n}\) and \(\mathbf{V} \in \mathbb{R}^{n,n}\) . \(\Sigma\)'s diagonal is filled with the square root of \(\mathbf{A}^T \mathbf{A}\)'s eigenvalues.

3.6 Singular values and matrix inversion

The singular value matrix is a zero-padded diagonal matrix

\begin{equation} \mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T = \mathbf{U} \left( \begin{array}{c c c} \sigma_1 & & \\ & \ddots & \\ & & \sigma_n \\ \hline & 0 & \\ \end{array} \right) \mathbf{V}^T . \end{equation}

Inverting the sigmas and transposing yields the pseudoinverse (Golub and Kahan (1965)).

\begin{equation} \mathbf{A}^{\dagger} = \mathbf{V} \Sigma^\dagger \mathbf{U}^T = \mathbf{V} \left( \begin{array}{c c c} \sigma_1^{-1} & & \\ & \ddots & \\ & & \sigma_n^{-1} \\ \hline & 0 & \\ \end{array} \right)^T \mathbf{U}^T . \end{equation}

3.6.1 Regularization via Singular Value Filtering

Originally we had a problem computing \(\mathbf{A}^{\dagger}\mathbf{b} = \mathbf{x}\). To solve it, we compute,

\begin{equation} \label{eq:reg} \mathbf{x}_{reg} = \sum_{i=1}^{n} f_i \frac{\mathbf{u}_i^T b}{\sigma_i}\mathbf{v_i} \end{equation}

The filter factors are computed using \(f_i = \sigma_i^2 / (\sigma_i^2 + \epsilon)\). Singular values \(\sigma_i < \epsilon\) are filtered. Expressing equation \ref{eq:reg} using matrix notation:

\begin{equation} \mathbf{x}_{reg}= \mathbf{V} \mathbf{F} \Sigma^{\dagger} \mathbf{U}^T \mathbf{b}_{noise} \end{equation}

with \(\mathbf{A} \in \mathbb{R}^{m,n}\), \(\mathbf{U} \in \mathbb{R}^{m,m}\), \(\mathbf{V} \in \mathbb{R}^{n,n}\), diagonal \(\mathbf{F} \in \mathbb{R}^{m,m}\), \(\Sigma^{\dagger} \in \mathbb{R}^{n,m}\) and \(\mathbf{b} \in \mathbb{R}^{n,1}\). \(\mathbf{F}\) has the \(f_i\) on its diagonal.

3.6.2 Regularized solution

The plot below illustrates the process for various filter values.

signal_complex

The gree line for example with \(10^{-6}\) looks like a good fit!

3.7 Conclusion

True scientists know what linear can do for them! Think about matrix shapes. If you are solving a problem, rule out all formulations where the shapes don't work. Regularization using the SVD is also known as Tikhonov regularization. We studied the process for polynomials here, but the intuition works similarly for large neural networks. The cost for fitting polynomials is vastly lower for polynomials, therefore we started here.

4 Bibliography

Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University Press, 2020. ↩

W. Gellert, H. Küstner, M. Hellwich, and H. Kästner. Kleine Enzyklopädie Mathematik. VEB Bibliogpraphisches Institut Leipzig, 1979. ↩

Gene Golub and William Kahan. Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2):205–224, 1965. ↩

Gilbert Strang, Gilbert Strang, Gilbert Strang, and Gilbert Strang. Introduction to linear algebra. Volume 4. Wellesley-Cambridge Press Wellesley, MA, 2009. ↩ ¹ ² ³