Work in progress
1 Motivating linear algebra
Même le feu est régi par les nombres.
Fourier (1768-1830) studied the transmission of heat using tools that would later be called an eigenvector-basis. Why would he say something like this?
To understand eigenvalues we will have to establish a few basics first. In the following sections lowercase letters denote scalars, lowercase boldface letters vertors, and uppercase boldface letters denote matrices.
2 Matrices vectors and operations
A useful way to look at matrices through equation systems, consider the system with m equations and n variables below,
We can arrange the coefficients \( a_{ij}\) into a matrix \(\mathbf{A} \in \mathbb{R}^{m,n}\) as a real-valued Matrix with \(m\) rows and \(n\) columns.
Similarly we can place the \(b\)s and \(x\)s in vectors \(\mathbf{b} \in \mathbb{R}^{m}\) and \(\mathbf{x} \in \mathbb{R}^{n}\), respectively.
Using matrix multiplication, we can rewrite the system \((\ref{eq:eqn_system})\) as:
Next we will consider operations with matrices. Starting with the matrix operation, that allowed us to rearrange the equation system.
2.1.1 Matrix Multiplication
Multiplying \(\mathbf{A} \in \mathbb{R}^{m,n}\) by \(\mathbf{B} \in \mathbb{R}^{n,p}\) produces \(\mathbf{C} \in \mathbb{R}^{m,p}\) (Gellert et al. (1979)),
To compute \(\mathbf{C}\) the elements in the rows of \(\mathbf{A}\) are multiplied with the column elements of \(\mathbf{C}\) and the products added,
The matrix product exists is the number of \(\mathbf{A}\)'s columns equals the number of \(\mathbf{B}\)'s rows.
When using numpy use @
for matrix multiplication.
2.1.2 Matrix-scalar multiplication
Multiplication of a Matrix \(\mathbf{A} \in \mathbb{R}^{m,n}\) with a real number \(s \in \mathbb{R}\) means multiplication of every matrix element with the scalar,
2.1.3 Addition
Two matrices \(\mathbf{A} \in \mathbb{R}^{m,n}\) and \(\mathbf{B} \in \mathbb{R}^{m,n}\) can be added by adding their elements
2.1.4 The Transpose
The transpose operation flips matrices along the diagonal, for example, in \(\mathbb{R}^2\),
Before we consider additional operations, lets take a minute to take a look at a very important matrix.
2.1.5 The identity matrix
The identity appears if we multiply two special matrices,
2.1.6 Matrix inverse
The inverse Matrix \(\mathbf{A}^{-1}\) undoes the effects of \(\mathbf{A}\), or in mathematical notation,
The process of computing the inverse is called Gaussian elimination.
Consider the example:
We can check the result:
The inverse exists for regular matrices. A Matrix \(\mathbf{A}\) is regular if \(\det(\mathbf{A}) \neq 0\). Whe will consider the determinant or \(\det\). in short in the next section.
2.1.7 The determinant
The determinant contains lots of information about a matrix in a single number. When a matrix has a zero determinant, a column is a linear combination of other columns. Its inverse does not exist. We require determinants to find eigenvalues by hand.
Computing determinants in two or three dimensions The two-dimensional case:
Computing the determinant of a three-dimensional matrix.
Works for any row or column, as long as we respect the sign pattern.
Example computation:
In numpy call np.det
. We can compute determinants in \(n\)-dimensions, but do not require this here.
3 Linear curve fitting
What is the best line connecting measurements? Consider the points below?
How do we find the line closest to all of these points?
3.1 Problem Formulation
A line has the form \(f(a) = da + c\), with \(c,a,d \in \mathbb{R}\). In matrix language, we could ask for every point to be on the line,
We can treat polynomials as vectors, too! The coordinates populate the matrix rows in \(\mathbf{A} \in \mathbb{R}^{m \times 2}\), and the coefficients appear in \(\mathbf{x} \in \mathbb{R}^{2}\), with the points we would like to model in \(\mathbf{b} \in \mathbb{R}^{m}\). The problem now appears in matrix form and can be solved using linear algebra!
3.2 The Pseudoinverse
The inverse exists for square or \(n\) by \(n\) matrices. Nonsquare \(\mathbf{A}\) such as the one we just saw, require the pseudoinverse (Strang et al. (2009),Deisenroth et al. (2020)),
Sometimes solving \(\mathbf{A}\mathbf{x} - \mathbf{b} = 0\) is impossible, the pseudoinverse considers,
instead. \(\mathbf{A}^{\dagger} \mathbf{b} = \mathbf{x}\) yields the solution.
Sometimes solving \(\mathbf{A}\mathbf{x} + \mathbf{b} = 0\) is implossible. But we can do our best to approximate a best fit.
We expect the gradient to vanish at the optimum,
This is not a rigorous proof, but it helps us motivate the pseudoinverse.
3.2.1 Linear regression
The pseudoinverse delivers a best possible fit solution to the problem described by equation system \(\ref{eq:line_system}\). The plot below illustrates the solution.
3.2.2 What about harder problems?
Often lines are not enough to model the underlying dynamics of a problem. Consider the signal below:
How do we come up with a mathematical model for this problem?
3.2.3 Fitting higher order polynomials
A possible solution is to move from a line model to polynomials. We can set up an equation system below with monomials like \(x^n\),
As we saw for the linear regression \(\mathbf{A}^{\dagger}\mathbf{b} = \mathbf{x}\) gives us the coefficients.
3.3 Overfitting
The figure below depicts the solution for a polynomial of 7th degree, that is \(n=7\).
We saw how linear algebra lets us fit polynomials to curves. For the 7th-degree polynomial noise dominates the model partially! What now?
3.4 Regularization
Is there a way to fix the previous example? To do so we start with a rather peculiar observation.
3.5 Eigenvalues and Eigen-Vectors
Multiply matrix \(\mathbf{A}\) with vectors \(\mathbf{x_1}\) and \(\mathbf{x_2}\),
we observe
Vector \(\mathbf{x_1}\) has not changed! Vector \(\mathbf{x_2}\) was multiplied by two. In other words,
Eigenvectors turn multiplication with a matrix into multiplication with a number,
Subtracting \(\lambda \mathbf{x}\) leads to,
The interesting solutions are those were \(\mathbf{x} \neq 0\), which means
Why do we want the identity in \(\mathbf{A} - \lambda \mathbf{I}\)? Because we are not allowed to subtract numbers from matrices.
Lets compute the eigenvalues and vectors for the initial example.
Determinant not useful numerically, software packages use QR-Method.
3.5.1 Eigenvalue Decomposition x
Eigenvalues let us look into the heart of a square system-matrix \(\mathbf{A} \in \mathbb{R}^{n,n}\) (Strang et al. (2009)).
with \(\mathbf{S} \in \mathbb{C}^{n,n}\) and \(\Lambda \in \mathbb{C}^{n,n}\).
3.5.2 Singular Value-Decomposition
SVD Strang et al. (2009) What about a non-square matrix \(\mathbf{A} \in \mathbb{R}^{m,n}\)? Idea:
Using the eigenvectors of the \(\mathbf{A}^T\mathbf{A}\) and \(\mathbf{A}\mathbf{A}^T\) we construct,
with \(\mathbf{A} \in \mathbb{R}^{m,n}\), \(\mathbf{U} \in \mathbb{R}^{m,m}\), \(\Sigma \in \mathbb{R}^{m,n}\) and \(\mathbf{V} \in \mathbb{R}^{n,n}\) . \(\Sigma\)'s diagonal is filled with the square root of \(\mathbf{A}^T \mathbf{A}\)'s eigenvalues.
3.6 Singular values and matrix inversion
The singular value matrix is a zero-padded diagonal matrix
Inverting the sigmas and transposing yields the pseudoinverse (Golub and Kahan (1965)).
3.6.1 Regularization via Singular Value Filtering
Originally we had a problem computing \(\mathbf{A}^{\dagger}\mathbf{b} = \mathbf{x}\). To solve it, we compute,
The filter factors are computed using \(f_i = \sigma_i^2 / (\sigma_i^2 + \epsilon)\). Singular values \(\sigma_i < \epsilon\) are filtered. Expressing equation \ref{eq:reg} using matrix notation:
with \(\mathbf{A} \in \mathbb{R}^{m,n}\), \(\mathbf{U} \in \mathbb{R}^{m,m}\), \(\mathbf{V} \in \mathbb{R}^{n,n}\), diagonal \(\mathbf{F} \in \mathbb{R}^{m,m}\), \(\Sigma^{\dagger} \in \mathbb{R}^{n,m}\) and \(\mathbf{b} \in \mathbb{R}^{n,1}\). \(\mathbf{F}\) has the \(f_i\) on its diagonal.
3.6.2 Regularized solution
The plot below illustrates the process for various filter values.
The gree line for example with \(10^{-6}\) looks like a good fit!
3.7 Conclusion
True scientists know what linear can do for them! Think about matrix shapes. If you are solving a problem, rule out all formulations where the shapes don't work. Regularization using the SVD is also known as Tikhonov regularization. We studied the process for polynomials here, but the intuition works similarly for large neural networks. The cost for fitting polynomials is vastly lower for polynomials, therefore we started here.
4 Bibliography
Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University Press, 2020. ↩
W. Gellert, H. Küstner, M. Hellwich, and H. Kästner. Kleine Enzyklopädie Mathematik. VEB Bibliogpraphisches Institut Leipzig, 1979. ↩
Gene Golub and William Kahan. Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2):205–224, 1965. ↩
Gilbert Strang, Gilbert Strang, Gilbert Strang, and Gilbert Strang. Introduction to linear algebra. Volume 4. Wellesley-Cambridge Press Wellesley, MA, 2009. ↩ 1 2 3